DiffIR: Efficient Diffusion Model for lmage Restoration

DiffIR: An Efficient Diffusion Model for Image Restoration

Paper link: https://arxiv.org/abs/2303.09472

Project link: https://github.com/Zj-BinXia/DiffIR

Abstract

The Diffusion Model (DM) achieves SOTA performance by modeling the image synthesis process as a sequential application of a denoising network. However, unlike image synthesis that generates every pixel ab initio, image restoration (IR) is where most pixels are given. Therefore, for IR, it is inefficient for traditional DMs to run a large number of iterations on a large model to estimate the entire image or feature map. To address this issue, we propose an efficient IRDM (DiffIR), which consists of a compact IR prior extraction network (CPEN), a dynamic IRtransformer (DIRformer), and a denoising network. Specifically, DiffIR has two training stages: pre-training and training DM. In pre-training, we feed real images into CPEN- S1 to capture a compact IR prior representation (IPR) to guide DIRformer. In the second stage, we train DM to directly estimate the same IRP as the pretrained CPEN- S1 using only LQ images. We observe that since IPR is just a compact vector, DiffIR can use fewer iterations than traditional DM to obtain accurate estimates and produce more stable and realistic results. Due to the small number of iterations, our DiffIR can adopt the joint optimization of CPEN S2 , diformer and denoising network, which further reduces the influence of estimation error. We conduct extensive experiments on several IR tasks and achieve SOTA performance while consuming less computational cost.

1. Introduction

Image restoration has been a long-standing problem due to its wide applicability and pathological nature. The purpose of IR is to restore a high-quality (HQ) image from its low-quality (LQ) counterpart corrupted by various degradation factors (e.g., blurring, masking, downsampling). Currently, deep learning-based IR methods have achieved impressive success because they can learn strong priors from large-scale datasets.

More recently, Diffusion Models (DM) [49] are built from layers of denoising autoencoders, which are useful in image synthesis [21, 50, 10, 22] and IR tasks such as inpainting [37, 45] and super-resolution Impressive results have been achieved in [47]). Specifically, DM is trained to iteratively denoise images through a backdiffusion process. DM has shown that principled probabilistic diffusion modeling can achieve high-quality mapping from randomly sampled Gaussian noise to complex target distributions such as real images or latent distributions [45] without suffering from model failure and training instability like GANs sex.

As a class of likelihood-based models, DM requires a large number of iterative steps (about 50–1000 steps) for large denoising models to simulate the precise details of the data, which consumes a lot of computing resources. Unlike image synthesis tasks where each pixel is generated from scratch, IR tasks only need to add precise details on a given LQ image. Therefore, if DM adopts the paradigm of IR image synthesis, it will not only waste a lot of computing resources, but also easily produce some details that do not match the given LQ image.

This paper aims to design a DM-based IR network to fully and effectively utilize the powerful distribution mapping capabilities of DM to restore images. To this end, we propose DiffIR. Since transformers can model long-range pixel dependencies, we adopt transformer blocks as the basic unit of DiffIR. We stack the transformer blocks in the form of Unet to form a dynamic transformer (Dynamic IRformer, DIRformer for short) to extract and aggregate multi-level features. We train DiffIR in two stages:

(1) In the first stage (Fig. 2(a)), we develop a compact IR prior extraction network (CPEN) to extract compact IR prior representations (IPR) from real images to guide DIRformer. In addition, we develop Dynamically Gated Feedforward Network (DGFN) and Dynamic Multi-Head Transpose Attention (DMTA) to take full advantage of IPR. It is worth noting that CPEN and DIRformer are optimized together.

(2) In the second stage (Fig. 2(b)), we train DM to estimate accurate IPR directly from LQ images. Since the IPR is light and only details are added for restoration, our DM can estimate a fairly accurate IPR and achieve a stable visual result after many iterations.

In addition to the novelty of the scheme and architecture described above, we also demonstrate the effectiveness of joint optimization. In the second stage, we observe that the estimated IPR may still have a small error, which will affect the performance of DIRformer. However, previous DMs require multiple iterations, which is not available for optimizing DM with decoders together. Since our DiffIR requires very few iterations, we can run all iterations, get the estimated IDR, and optimize jointly with DIRformer. As shown in Figure 1, our DiffIR achieves SOTA performance and consumes much less computation than other DM-based methods such as RePaint [37] and LDM [45]. In particular, DiffIR is 1000 times more efficient than RePaint. Our main contributions are threefold:

  1. We propose DiffIR, a powerful, simple, and efficient DMIR-based baseline. Unlike image synthesis, most of the pixels of the IR input image are given. Therefore, we exploit the powerful mapping ability of DM to estimate a compact IPR to guide IR, thereby improving the recovery efficiency and stability of DM in IR.
  2. We propose DGTA and DGFN for dynamic IRformer to make full use of PIR. Unlike previous latent DMs that optimize the denoising network alone, we propose the joint optimization of the denoising network and the decoder (i.e., DIRformer) to further improve the robustness of the estimation error.
  3. Extensive experiments demonstrate that the proposed DiffIR can achieve SOTA performance in IR tasks while consuming much less computational resources compared to other DM-based methods.

insert image description here

2. Related Work

Image restoration . As pioneer works, SRCNN [13], DnCNN [79] and ARCNN [12] achieve impressive performance on IR with compact CNNs. Since then, CNN-based methods have become more popular compared to traditional IR methods. So far, researchers have studied CNN from different perspectives, resulting in more elaborate network architecture design and learning schemes, such as residual blocks [27, 76, 6], GANs [19, 60, 44], Attention [81, 61, 9, 67, 66, 63, 68], knowledge distillation [62], etc. [24, 17, 28, 16, 71].

In recent years, the natural language processing model transformer has gained a lot of attention in the computer vision community. Compared with CNN, the transformer can simulate the global interaction between different regions and achieve the state-of-the-art performance. At present, transformers have been widely used in vision tasks such as image recognition [15, 55], image segmentation [57, 64, 82], object detection [5, 84], and image restoration [7, 35, 69, 33].

Diffusion model . Diffusion Models (DM) [21] have achieved state-of-the-art results in density estimation [29] and sample quality [10]. DM uses a parameterized Markov chain to optimize the down-variation boundary of the likelihood function, so that the target distribution generated by it is more accurate than that of generative models such as GAN. In recent years, DM has gained increasing influence in the field of image restoration tasks, such as super-resolution [26, 47] and image inpainting [37, 45]. SR3 [47] introduces DM into image super-resolution and achieves better performance than SOTA GAN based methods. Furthermore, Palette [46] proposes a conditional diffusion model for IR, inspired by the conditional generative model [40]. LDM [45] proposed to DM the latent space to improve the restoration efficiency. Furthermore, RePaint [37] designed an improved denoising strategy through resampling iterations in DM. However, these DM-based IR methods directly use the paradigm of DM for image synthesis. However, most of the pixels in IR are given, it is necessary to perform DM on the whole image or feature map. Our DiffIR performs DM on a compact IPR, which can make the DM process more efficient and stable.

3. Preliminaries: Diffusion Models

In this paper, we employ diffusion models (DM) [21] to generate accurate IR prior representations (IPR). During the training phase, the DM method defines a diffusion process that takes the input image x 0 x_0x0Convert to Gaussian noise x T ∼ N ( 0 ; 1 ) x_T \sim \mathcal{N} (0;1)xTN(0;1 ) Through T iterations. Each iteration of the diffusion process can be described as follows:
q ( xt ∣ xt − 1 ) = N ( xt ; 1 − β txt − 1 , β t I ) , (1) q\left(x_{t}\mid x_{ t-1}\right)=\mathcal{N}\left(x_{t};\sqrt{1-\beta_{t}}x_{t-1},\beta_{t}\mathbf{I}\ right), \tag{1}q(xtxt1)=N(xt;1bt xt1,btI),
In formula ( 1 ) , xt x_txtis the time step ttNoise image of t , β t \beta_{t}btis a predefined scaling factor, N \mathcal{N}N is a Gaussian distribution. Formula (1) can be further simplified as:
q ( xt ∣ x 0 ) = N ( xt ; α ˉ tx 0 , ( 1 − α ˉ t ) I ) , (2) q\left(\mathbf{x}_{ t}\mid\mathbf{x}_{0}\right)=\mathcal{N}\left(\mathbf{x}_{t};\sqrt{\bar{\alpha}_{t}}\ mathbf{x}_{0},\left(1-\bar{\alpha}_{t}\right)\mathbf{I}\right), \tag{2}q(xtx0)=N(xt;aˉt x0,(1aˉt)I),(2)
其中 α t = 1 − β t , α ˉ t = ∏ i = 0 t α i \alpha_{t}=1-\beta_{t},\bar{\alpha}_{t}=\prod_{i=0}^{t}\alpha_{i} at=1bt,aˉt=i=0tai

In the inference phase (the reverse process), the DM method takes a Gaussian random noise graph x T x_TxTis sampled, then step by step on x T x_TxTDo noise reduction until you get high quality output x 0 x_0x0:
p ( xt − 1 ∣ xt , x 0 ) = N ( xt − 1 ; μ t ( xt , x 0 ) , σ t 2 I ) , (3) p\left(\mathbf{x}_{t- 1}\mid\mathbf{x}_{t},\mathbf{x}_{0}\right)=\mathcal{N}\left(\mathbf{x}_{t-1};\mathbf{ \mu}_{t}\left(\mathbf{x}_{t},\mathbf{x}_{0}\right),\sigma_{t}^{2}\mathbf{I}\right) , \tag{3}p(xt1xt,x0)=N(xt1;mt(xt,x0),pt2I),( 3 )
Independent functionμ t ( xt , x 0 ) = 1 α t ( xt − ϵ 1 − α t 1 − α ~ t ) \mu_{t}\left(\mathbf{x}_{t},\ mathbf{x}_{0}\right)={\frac{1}{\sqrt{\alpha_{t}}}}\left(\mathbf{x}_{t}-\epsilon{\frac{1 -\alpha_{t}}{\sqrt{1-\tilde{\alpha}_{t}}}}\right)mt(xt,x0)=at 1(xtϵ1a~t 1 at)和方差 σ t 2 = 1 − α ˉ t − 1 1 − α ˉ t β t \sigma_{t}^{2}=\frac{1-\bar{\alpha}_{t-1}}{1-\bar{\alpha}_{t}}\beta_{t} pt2=1aˉt1aˉt1btϵ \epsilonϵ meansxt x_txtThe noise in , which is the only uncertain variable in the reverse process. DM uses denoising network ϵ θ ( xt ; T ) \epsilon_θ(x_t;T)ϵi(xt;T ) estimatesϵ \epsilonϵ θ( xt ; T ) \epsilon_θ(x_t;T)ϵi(xt;T ) , given a clean imagex 0 x_0x0, DM random sampling time step ttt infinity∼ N ( 0 , I ) \epsilon\sim\mathcal{N}(0,\mathbf{I})ϵN(0,1 ) Generate a noisy imagext x_txt. Then, DM for ϵ θ \epsilon_θϵiThe quantum equation is given by [21]:
∇ θ ∥ ϵ − ϵ θ ( α ˉ tx 0 + ϵ 1 − α ˉ t , t ) ∥ 2 2 . (4) \nabla_{\theta}\left\|\epsilon-\epsilon_{\theta}\left(\sqrt{\bar{\alpha}_{t}}\mathrm{x}_{0}+\ epsilon\sqrt{1-\bar{\alpha}_{t}},t\right)\right\|_{2}^{2}. \tag{4}i ϵϵi(aˉt x0+ϵ1aˉt ,t) 22.(4)

4. Methodology

Traditional DMs [49, 45, 37] require a large number of iterations, computational resources, and model parameters to generate accurate and realistic images or latent feature maps. Although DM has achieved impressive performance in generating images from scratch (image synthesis), directly applying the DM paradigm of image synthesis to IR is a waste of computational resources. Since most of the pixels and information in IR are given, performing DM on the entire image or feature map not only requires a lot of iterations and calculations, but also tends to produce more artifacts. Overall, DM has strong data estimation capabilities, but it is not efficient to apply existing DM models to IR image synthesis. To address this issue, we propose an efficient IR decomposition model (i.e., DiffIR), which employs the decomposition model to estimate a compact IPR to guide the network to restore images. Since IPR is quite light, the model size and iterations of DiffIR can be greatly reduced compared to traditional DM, resulting in more accurate estimates.

insert image description here

In this section, we introduce DiffIR. As shown in Figure 2, DiffIR mainly consists of a compact IR prior extraction network (CPEN), a dynamic IR transformer (DIRformer) and a denoising network. We train DiffIR in two stages, including pre-training of DiffIR and training of the diffusion model. In the following sections, we first introduce the pretrained DiffIR in Section 4.1. We then provide the details of DiffIR's training efficiency DM in Section 4.2.

4.1. Pretrain DiffIR

Before introducing pre-trained DiffIR, we would like to introduce two networks in the first stage, including Compact IR Prior Extraction Network (CPEN) and Dynamic IRformer (DIRformer). The structure of CPEN is shown in the yellow box in Figure 2, which mainly consists of residual blocks and linear layers stacked to extract a compact IR prior representation (IPR). Afterwards, DIRformer can use the extracted IDR to restore the LQ image. The structure of DIRformer is shown in the pink box in Figure 2, which is stacked by Unet-shaped dynamic DIRformer blocks. The dynamic DIRformertransformer module consists of dynamic multi-head transpose attention (DMTA, green box in Figure 2) and dynamic gated feed-forward network (DGFN, blue box in Figure 2), which can use IDR as a dynamic modulation parameter to add restoration details to the feature map middle.

In pre-training (Fig. 2(a)), we train CPEN- S1 together with DIRformer. Specifically, we first concatenate the ground-truth and LQ images together and downsample them using the PixelUnshuffle operation to obtain the input of CPEN- S1 . Then, CPEN S1 extracts IPR Z ∈ R 4 C ′ \mathbf{Z}\in\mathbb{R}^{4C^{\prime}}ZR4C为:
Z = C P E N S 1 ( P i x e l U n s h u f f l e ( C o n c a t ( I G T , I L Q ) ) ) . (5) \mathbf{Z}=\mathrm{CPEN}_{\mathrm{S}1}(\mathrm{PixelUnshuffle}(\mathrm{Concat}(I_{GT},I_{LQ}))). \tag{5} Z=CPENS 1( PixelUnshuffle ( Concat ( IGT,ILQ))).( 5 )
Then send IPR Z as a dynamic modulation parameter to DGFN and DMTA of DIRformer to guide recovery:
F ′ = W l 1 Z ⊙ N orm ( F ) + W l 2 Z , (6) \mathbf{F}^ {\prime}=W_{l}^{1}\mathbf{Z}\odot\mathrm{Norm}(\mathbf{F})+W_{l}^{2}\mathbf{Z}, \tag{ 6}F=Wl1ZNorm(F)+Wl2Z,( 6 )
where⊙ \odot means element-wise multiplication, Norm means layer normalization [2],W l W_lWlRepresents a linear layer, F \mathbf{F}F F ′ ∈ R H ^ × W ^ × C ^ \mathbf{F}^{\prime} \in \mathbb{R}^{\hat{H} \times \hat{W} \times \hat{C}} FRH^×W^×C^ are input and output feature maps respectively,W l 1 Z , W l 2 Z ∈ RC W_l^1 \mathbf{Z}, W_l^2 \mathbf{Z} \in \mathbb{R}^CWl1Z,Wl2ZRC

Then, we aggregated the global spatial information in DMTA. Specifically, F ′ \mathbf{F}^{\prime}F is projected as the queryQ = W d QW c QF ′ \mathbf{Q}=W_d^Q W_c^Q \mathbf{F}^{\prime}Q=WdQWcQF ,键K = W d KW c KF ′ \mathbf{K}=W_d^K W_c^K \mathbf{F}^{\prime}K=WdKWcKF,值V = W d VW c VF ′ \mathbf{V}=W_d^V W_c^V \mathbf{F}^{\prime}V=WdVWcVF , whereW c W_cWcis 1 × 1 1 \times 11×1- point convolution,W d W_dWdis 3 × 3 3 \times 33×3 deep convolutions. Next, we reshape the queryQ ^ ∈ RA ^ W ^ × C ^ \hat{\mathbf{Q}} \in \mathbb{R}^{\hat{A} ​​\hat{W} \times \hat {C}}Q^RA^W^×C^,键 K ^ ∈ R C × ^ H ^ \hat{\mathbf{K}} \in \mathbb{R}^{\hat{C \times} \hat{H}} K^RC×^H^和值 V ^ ∈ R H ^ W ^ × C ^ \hat{\mathbf{V}} \in \mathbb{R}^{\hat{H} \hat{W} \times \hat{C}} V^RH^W^×C^ . After that, we inQ ^ \hat{\mathbf{Q}}Q^K^\hat{\mathbf{K}}KDot product between ^ to generate a size ofRC × C \mathbb{R}^{C \times C}RC × C transpose attention graphA \mathbf{A}A , which is larger thanRH ^ W ^ × H ^ W ^ \mathbb{R}^{\hat{H} \hat{W} \times \hat{H} \hat{W}}RH^W^×H^W^ 's regular attention map is more effective. The whole process of DMTA can be described as follows:

F ^ = W c V ^ ⋅ Softmax ⁡ ( K ^ ⋅ Q ^ / γ ) + F , (7) \hat{\mathbf{F}}=W_c \hat{\mathbf{V}} \cdot \operatorname{ Softmax}(\hat{\mathbf{K}}\cdot \hat{\mathbf{Q}}/\gamma)+\mathbf{F}, \tag{7}F^=WcV^Softmax(K^Q^/ c )+F,( 7 )
whereγ \gammaγ is a learnable scaling parameter. Like traditional multi-head self-attention [15, 7], we divide channels into multi-heads and compute attention maps.

Next, in DGFN, we aggregate local features. We use 1×1 convolution to aggregate information from different channels, and 3×3 depthwise convolution to aggregate information from spatially adjacent pixels. In addition, we employ a gating mechanism to enhance information encoding. The whole process of DGFN is defined as:
F ^ = GELU ( W d 1 W c 1 F ′ ) ⊙ W d 2 W c 2 F ′ + F . (8) \mathbf{\hat{F}}=\mathrm{GELU }\left(W_{d}^{1}W_{c}^{1}\mathbf{F}^{\prime}\right)\odot W_{d}^{2}W_{c}^{2 }\mathbf{F}^{\prime}+\mathbf{F}. \tag{8}F^=shave(Wd1Wc1F)Wd2Wc2F+F.( 8 )
We train CPEN-S1and DIRformer together, so that DIRformer can make full use ofthe IPR extracted byS1The training loss is defined as follows:
L rec = ∥ IGT − I ^ HQ ∥ 1 , (9) L_{rec}=\left\|I_{GT}-\hat{I}_{HQ}\right\|_{1 }, \tag{9}Lrec= IGTI^HQ 1,( 9 )
Among themIGT I_{GT}IGTand I ^ HQ \hat{I}_{HQ}I^HQare the ground-truth and restored HQ images, respectively. ∥ ⋅ ∥ 1 \|\cdot\|_11Indicates the L1 norm. If some works emphasize visual quality, such as inpainting and SISR, we can further add perceptual loss and adversarial loss. More details are provided in the supplementary material.

4.2. Diffusion Model for Image Restoration

In the second stage (Fig. 2(b)), we exploit the powerful data estimation capability of DM to estimate the IPR. Specifically, we use the pretrained CPEN S1 to capture the IPR Z ∈ R 4 C ′ \mathbf{Z}\in\mathbb{R}^{4C^{\prime}}ZR4C' . After that, we apply the diffusion process on Z to sampleZT ∈ R 4 C ′ \mathbf{Z}_T\in\mathbb{R}^{4C^{\prime}}ZTR4CAlso , we have the following:
q ( ZT ∣ Z ) = N ( ZT ; α ˉ TZ , ( 1 − α ˉ T ) I ) , (10) q\left(\mathbf{Z}_{T}\mid \mathbf{Z}\right)=\mathcal{N}\left(\mathbf{Z}_{T};\sqrt{\bar{\alpha}_{T}}\mathbf{Z},\left( 1-\bar{\alpha}_{T}\right)\mathbf{I}\right), \tag{10}q(ZTZ)=N(ZT;aˉT Z,(1aˉT)I),( 10 )
where T is the total number of iterations,α ˉ \bar{\alpha}aˉ andα \alphaα is defined in equations (1) and (2). (ieα ˉ T = ∏ i = 0 T α i \bar{\alpha}_{T}=\prod_{i=0}^{T}\alpha_{i}aˉT=i=0Tai)。

In the reverse process, since IPR is compact, DiffIR- S2 can use fewer iterations and a smaller model size than traditional DM to obtain reasonably good estimates [45, 37]. Since traditional DMs have a huge computational cost in iterations, they must randomly sample a time step t ∈ [ 1 , T ] t \in [1,T]t[1,T ] , optimize the denoising network only at this time step (Equations (1), (2), (3), (4)) Due to the lack of joint training of the denoising network and the The estimation error caused by the network is small, preventing DIRformer from fulfilling its potential. In contrast, DiffIR starts at the tth time step (Equation (10)), runs all denoising iterations (Equation (11)), and obtainsZ ^ \hat{Z}Z^ , and pass it to DIRformer for joint optimization.
Z ^ t − 1 = 1 α t ( Z ^ t − ϵ 1 − α t 1 − α ˉ t ) , (11) \hat{\mathbf{Z}}_{t-1}=\frac{1} {\sqrt{\alpha_{t}}}\left(\hat{\mathbf{Z}}_{t}-\epsilon\frac{1-\alpha_{t}}{\sqrt{1-\bar{ \alpha}_{t}}}\right), \tag{11}Z^t1=at 1(Z^tϵ1aˉt 1at),( 11 )
whereϵ \epsilonϵ represents the same noise, we use CPENS2and denoising network to predict the noise in Equation (3). It is worth noting that, unlike traditional DM in Eq. (3), our DiffIR-S2removes the variance estimate and finds that it contributes to accurate IPR estimation and better performance (Section 6).

In the reverse process of DM, we first use CPEN S2 to obtain the conditional vector D ∈ R 4 C ′ \mathbf{D}\in\mathbb{R}^{4C^{\prime}} from the LQ imageDR4C :
D = CPENS 2 ( P ixel U nshuffle ( ILQ ) ) , (12) \mathbf{D}=\mathrm{CPEN}_{\mathrm{S}2}(\mathrm{PixelUnshuffle}(I_{LQ} ), \tag{12}D=CPENS 2( PixelUnshuffle ( ILQ)),( 12 )
Among them, except that the input dimension of the first convolution is different, CPENS2has the same structureas CPENS1Then, we use the denoising network θ to estimate the noise at each time step t asϵ θ ( C oncat ( Z ^ t , t , D ) ) . \epsilon_{\theta}(\mathrm{Concat}(\hat{ \mathbf{Z}}_{t},t,\mathbf{D})).ϵi(Concat(Z^t,t,D )) . . Substituting the estimated noise into formula (11), getZ ^ t − 1 \hat{Z}_{t−1}Z^t1, to start the next iteration.

Then, after T iterations, we get the final estimated IPR Z ^ ∈ R 4 C ′ \hat{\mathbf{Z}}\in\mathbb{R}^{4C^{\prime}}Z^R4C' . We use Lallto jointly train DPENS2, denoising network and DIRformer:
L diff = 1 4 C ′ ∑ i = 1 4 C ′ ∣ Z ^ ( i ) − Z ( i ) ∣ , L all = L rec + L diff , (13) \begin{aligned}\mathcal{L}_{diff}=\frac{1}{4C'}\sum_{i=1}^{4C'}\left|\hat{\mathbf{Z }}(i)-\mathbf{Z}(i)\right|, \mathcal{L}_{all}=\mathcal{L}_{rec}+\mathcal{L}_{diff}, \end {aligned} \tag{13}Ldiff=4C1i=14C Z^(i)Z(i) ,Lall=Lrec+Ldiff,( 13 )
We can further add perceptual loss and adversarial loss in Laallto get better visual quality, as in Equation (9).

In the inference stage, we only use the backdiffusion process (Fig. 2(b) bottom). CPEN S2 extracts the conditional vector D from the LQ image, randomly extracts Gaussian noise Z ^ T \hat{Z}_TZ^T. The denoising network utilizes Z ^ T \hat{Z}_TZ^Tand D to estimate the IPR Z ^ \hat{Z} after T iterationsZ^ . Afterwards, DIRformer restores the LQ image using IPR.

5. Experiments

5.1. Experiment setup

We apply our method to three typical IR tasks: (a) image inpainting, (b) image super-resolution (SR), © single image motion deblurring. Our DiffIR employs a 4-stage encoder-decoder structure. From level 1 to level 4, the attention heads of DMTA are [1, 2, 4, 8] and the number of channels is [48, 96, 192, 384]. Furthermore, in all IR tasks, we tune the number of dynamic transformer blocks in DIRformer to compare the differences between DiffIR and SOTA methods in terms of similar parameters and computational cost. Specifically, from layer 1 to layer 4, we set the number of dynamic transformer blocks as [1,1,1,9], [13,1,1,1], [3,5,6,6] respectively Used for image inpainting, SR and deblurring. Furthermore, following previous work [37, 45], we introduce adversarial and perceptual losses for image inpainting and SR. CPEN channel number C' C'C' set to 64.

When training the diffusion model, the total time step T is set to 4, and βt (αt = 1−βt) in Equation (11) increases linearly from β1 = 0.1 to βt = 0.99. We train the model using the Adam optimizer (β1 = 0.9, β2 = 0.99). More details are contained in the supplementary material.

5.2. Image inpainting evaluation

We use the same settings of LaMa to train and validate our DiffIRS2 inpainting [52]. Specifically, we train DiffIR with a batch size of 30 and a patch size of 256 on the Places-Standard [83] and CelebA-HQ [25] datasets, respectively. We compared DiffIRS2 with SOTA image inpainting methods (ICT [56], LaMa [52] and RePaint [37]) on a validation dataset using LPIPS [80] and FID [20].

The quantitative results are shown in Table 1 and Figure 1(a). We can see that our DiffIR- S2 significantly outperforms other methods. Specifically, our DiffIR- S2 outperforms the competing method LaMa by FID margins of up to 0.2706 and 0.5583 with wide masks on Places and CelebA-HQ, consuming similar parameters and the total number of multiadds. Moreover, compared with the DM-based RePaint method [45], our DiffIR- S2 can achieve better performance while consuming only 4.3% parameters and 0.1% computational resources. This shows that DiffIR can fully and effectively utilize the data estimation ability of DM for IR.

insert image description here

Qualitative results are shown in Figure 3. Our DiffIR S2 can produce more realistic and reasonable structures and details than other competing image completion methods. More qualitative results are provided in the supplementary material.

insert image description here

5.3. Image super-resolution evaluation

We train and validate DiffIR- S2 on image super-resolution . Specifically, we train DiffIR- S2 on the DIV2K [1] (800 images) and Flickr2K [54] (2650 images) datasets for 4× super-resolution. The batch size is set to 64 and the LQ patch size is set to 64×64. We evaluate our DiffIRS2 and other based SR methods for SOTA GANs.

Table 2 and Fig. 1(b) show the performance of DiffIR S2 with SOTA GAN-based SR methods SFTGAN [59], SRGAN [32], ESRGAN [60], USRGAN [75], SPSR [39] and BebyGAN [34] Compare with multiadd. We can see that DiffIR S2 achieves the best performance. Compared with the competitor's SR method BebyGAN, our DiffIR- S2 achieves LPIPS differences as high as 0.0121 and 0.0175 on General100 and Urban100, while only consuming 63% of computing resources. Furthermore, it is worth noting that DiffIR- S2 significantly outperforms the DM-based method LDM while consuming 2% of computational resources.

insert image description here

Qualitative results are shown in Figure 4. DiffIR S2 achieves the best visual quality and contains more realistic details. These visual comparisons are consistent with the quantitative results, showing the superiority of DiffIR. DiffIR can effectively use powerful DM to restore images. More visual results are given in the supplementary material.

insert image description here

5.4. Image motion deblurring evaluation

We train DiffIR on the GoPro [41] dataset for image motion deblurring and evaluate DiffIR on two classic benchmarks (GoPro, HIDE [48]). We compare DiffIR- S2 with state-of-the-art image motion deblurring methods, including Restormer [69], MPRNet [70] and IPT [7].

Quantitative results (PSNR and SSIM) are shown in Table 3, and Multi-Adds are shown in Figure 1 ©. We can see that our DiffIR- S2 outperforms other motion deblurring methods. Specifically, DiffIR S2 outperforms IPT and MIMIUnet+ by 0.68 dB and 0.54 dB on GoPro, respectively. In addition, DiffIR- S2 outperforms Restormer by 0.28 dB and 0.33 dB on GoPro and HIDE datasets, respectively, while only consuming 78% of computing resources. This demonstrates the effectiveness of DiffIR.

insert image description here

Qualitative results are shown in Fig. 5, and our DiffIR- S2 has the best visual quality, containing realistic details closer to the corresponding HQ image. More qualitative results are provided in the supplementary material.

insert image description here

6. Ablation Study

Efficient Diffusion Models for Image Restoration . In this part, we verify the effectiveness of DM in DiffIR, the training scheme of DM and whether components such as variance noise are inserted in DM (Table 4).

insert image description here

(1) DiffIR S2 -v3 is actually the DiffIR S2 used in Table 1 , and DiffIR S1 is the first-stage pre-training network that uses real images as input. Comparing DiffIRS1 and DiffIR S2 -V3, we can see that the LPIPS of DiffIR S2 -V3 is very similar to that of DiffIR S1 , which means that DM has a powerful data modeling capability and can accurately predict IPR.

(2) To further demonstrate the effectiveness of DM, we cancel the use of DM in DiffIRS2-V3 to obtain DiffIRS2V1. Comparing DiffIR S2 -V1 and DiffIR S2 -V3, we can see that DiffIR S2 -V3 (using DM) is significantly better than DiffIR S2 -V1. This means that the IPR learned by DM can effectively guide DIRformer to restore LQ images.

(3) To explore better DM training schemes, we compare two training schemes, traditional DM optimization and our proposed joint optimization. Since traditional DM [45,49] requires multiple iterations to estimate large images or feature maps, traditional DM optimization must be employed to optimize the denoising network by randomly sampling a time step, which cannot use the latter decoder ( That is, DIRformer in this article) for optimization. Since DiffIR only uses DM to estimate the compact 1D vector IPR, we can use multiple iterations to obtain reasonably accurate results. Therefore, we can take a joint optimization approach, run all iterations of the denoising network, obtain the IPR and co-optimize with DIRformer. Comparing DiffIR S2 -V2 and DiffIR S2 -V3, DiffIR S2 -V3 significantly outperforms DiffIR S2 -V2, which demonstrates the effectiveness of our proposed joint optimization training DM. This is because the small estimation error of DM in IPR may lead to the performance degradation of DIRformer. Jointly training DM and DIRformer can solve this problem.

(4) In traditional DM methods, they will insert variance noise (Equation (3)) in the reverse DM process to generate more realistic images. Different from traditional DM to predict images or feature maps, we use DM to estimate IPR. In DiffIR S2 -V4, we add noise during reverse DM. We can see that DiffIR S2 -V3 achieves better performance than DiffIR S2 -V4. This means that in order to guarantee the accuracy of the estimated IPR, it is better to remove the added noise.

DM's loss function . We explore which loss function is best for guiding the denoising network and CPEN- S2 learning to estimate accurate IPR from LQ images. Here, we define three loss functions. (1) We define L dif for optimization (Equation (13)). (2) We use L 2 (formula (14)) to measure the estimation error. (3) We use the Kullback Leibler divergence to measure the distribution similarity (L kl , Equation (15)).
L 2 = 1 4 C ′ ∑ i = 1 4 C ′ ( Z ^ ( i ) − Z ( i ) ) 2 , L kl = ∑ i = 1 4 C ′ Z norm ( i ) log ⁡ ( Z norm ( i ) Z ^ norm ( i ) ) , \begin{align} \mathcal{L}_{2}=\frac{1}{4C^{\prime}}\sum_{i=1}^{4C^{\ prime}}\left(\hat{\mathbf{Z}}(i)-\mathbf{Z}(i)\right)^{2}, \tag{14} \\ \mathcal{L}_{kl }=\sum_{i=1}^{4C^{\prime}}\mathbf{Z}_{norm}(i)\log\left(\frac{\mathbf{Z}_{norm}(i) }{\mathbf{\hat{Z}}_{norm}(i)}\right), \tag{15} \end{align}L2=4C1i=14C(Z^(i)Z(i))2,Lkl=i=14CZnorm(i)log(Z^norm(i)Znorm(i)),(14)(15)
where Z ^ \hat{\mathbf{Z}}Z^Z ∈ R 4 C ′ \mathbf{Z} \in \mathbb{R}^{4C^{\prime}}ZR4C' are IPRs extractedfor DiffIRS1and DiffIRS2Z ^ norm \hat{\mathbf{Z}}_{\text {norm }}Z^norm Z norm ∈ R 4 C ′ \mathbf{Z}_{\text {norm}} \in \mathbb{R}^{4C^{\prime}}Znorm R4C Respectively useZ ^ \hat{\mathbf{Z}}Z^Z \mathbf{Z}The softmax operation of Z is normalized. We apply these three loss functions separately on DiffIR-S2to learn to estimate accurate IPR directly from LQ images. We then evaluated them on CelebA-HQ. The results are shown in Table 5. We can see that Ldifoutperforms L2and Lkl.

insert image description here

The effect of the number of iterations . In this part, we explore how the number of iterations in DM affects the performance of DiffIR- S2 . We set different iterations in DiffIR S2 , and adjust βt (αt=1−βt) in formula (10), so that Z is in the diffusion process (ie α ˉ T → 0 \bar{\alpha}_{T }\rightarrow0aˉT0 ) becomes Gaussian noiseZT ∼ N ( 0 , 1 ) Z_T\sim \mathcal{N}(0,1)ZTN(0,1 ) . The result is shown in Figure 6. When the iteration is increased to 3 times, the performance of DiffIRS2will be significantly improved. When the number of iterations is greater than 4, DiffIRS2remains almost stable, that is, reaches the upper bound. Furthermore, we can see that our DiffIR-S2converges faster than traditional DM (more than 200 iterations are required). This is because we only perform DM on the IPR (a compact 1D vector).

insert image description here

7. Conclusion

Traditional DMs have achieved impressive performance in image synthesis. Unlike image synthesis where each pixel is generated from scratch, IR gives the LQ image as a reference. Therefore, it is inefficient to directly apply traditional decision-making models to IR. In this paper, we propose an efficient IR diffusion model (i.e. DiffIR), which consists of CPEN, Diformer and denoising network. Specifically, we first feed the ground-truth image into CPEN- S1 to generate a compact IPR to guide DIRformer. Then, we train DM to estimate the IPR extracted by CPEN S1 . Compared with traditional DM, our DiffIR can use fewer iterations than traditional DM to obtain accurate estimates and reduce artifacts in the restored image. In addition, due to the small number of iterations, our DiffIR can adopt the joint optimization of CPEN S2 , Diformer and denoising network to reduce the influence of estimation error. Extensive experiments show that DiffIR can achieve general SOTA IR performance.

Appendix

A. Appendix

B. Evaluation on Real-world SR

insert image description here

insert image description here

C. Algorithm

insert image description here

D. More Training Details on Inpainting

E. More Training Details on SR

F. More Training Details on deblurring

G. More Visual Comparisons on Inpainting

insert image description here

H. More Visual Comparisons on SR

insert image description here

insert image description here

I. More Visual Comparisons on Deblurring

insert image description here

Guess you like

Origin blog.csdn.net/weixin_43790925/article/details/131969200