Image Deblurring: A Detailed Explanation of the MSSNet Model

This content mainly introduces the MSSNet model   for single image deblurring .

论文MSSNet: Multi-Scale-Stage Network for Single Image Deblurring

Code (official) : https://github.com/kky7/MSSNet


1. Background

  Single-image deblurring aims to recover a single sharp image from a blurred image (blur caused by camera shake or the motion of the object being photographed). Image deblurring has been extensively studied over the past few decades since image blurring can severely degrade image quality and performance of other tasks such as object detection.

  Prior to deep learning, most classical single-image deblurring methods estimated the blur kernel (which describes how the image is blurred) and the underlying sharp image through alternate optimization. In order to efficiently and accurately estimate blur kernels and latent sharp images, these classical methods widely adopt a coarse-to-fine scheme. Coarse-to-fine scheme: first estimate a small blur kernel and sharp image at the coarse scale, and then use them as the initial solution at the next scale . The small size of the image and the blur at the coarse scale enable fast estimation. Furthermore, the small blur size of the coarse scale enables more accurate estimation of blur kernels and latent images. Therefore, the coarse-to-fine scheme can quickly provide accurate initial solutions for the next scale and improve the quality and efficiency of deblurring.

  After the emergence of deep learning, deep learning was successfully applied to the field of single image deblurring. Single image deblurring models based on deep learning can be divided into two categories:

  • Following the scheme of the traditional deblurring model, first use the CNN network to estimate the blur kernel, and then use the blur kernel to obtain a clear image.
  • Recover sharp images directly from blurred images using deep learning networks in an end-to-end manner, without estimating blur kernels.

  At the same time, models that use deep learning networks to directly restore blurred images to obtain clear images can be divided into two categories:

  • Multi-scale methods such as models such as DeepDeblur, SRN, and PSS-NSC. Due to the effectiveness of the coarse-to-fine scheme, such models basically adopt the coarse-to-fine scheme. They usually employ a multi-scale neural network architecture, stacking sub-networks of different scales together, first estimating small-scale latent clear images, and then using the small-scale latent clear images as a guide to estimate large-scale latent clear images. Regardless of whether the blur kernel is estimated or not, the motivation for using a coarse-to-fine scheme is the same, i.e., the deblurred image can be estimated more efficiently and accurately due to the small image and blur size at the coarse scale .

  • Single-scale methods, such as DMPHN, MT-RNN, MPRNet, HINet and other models. The main difference from multi-scale is that it only accepts blurred images of one scale as input.

  The authors of the single-scale method point out that previous multi-scale schemes are computationally time-consuming, and that the coarse-scale results contribute less to the final deblurring quality. And these single-scale methods surpass the previous multi-scale methods in terms of quality and computational time consumption, making the traditional coarse-to-fine scheme seem outdated.

  Based on this situation, the authors analyze the flaws of previous coarse-to-fine methods that degrade the model performance by revisiting the coarse-to-fine scheme. Then in order to solve these defects, MSSNet (Multi-Scale-Stage Network, multi-scale multi-stage network) was proposed. Next, let's take a look at how the author proposed the MSSNet network step by step.

2. Analysis of previous coarse-to-fine methods

  Let's first take a look at the previous coarse-to-fine method to achieve single image deblurring process. Figure 1.1 shows the network architecture of the previous coarse-to-fine approach. Among them, SRN also has additional cyclic connections between adjacent scale sub-networks (these additional connections can obtain additional performance gains), which are not reflected in the figure. The previous coarse-to-fine methods basically follow the same deblurring process, and the detailed steps are as follows:

  1. First, an image pyramid is constructed by downsampling the input blurred image.
  2. Then, starting from the coarsest scale, a deblurred image is estimated from the downsampled blurred image, the estimated deblurred image is upsampled, and fed into the sub-network for the next scale.
  3. Finally, the next-scale sub-network uses the previous-scale deblurred image as a guide to estimate a deblurred image from the current-scale blurred image.
Figure 1.1 The network architecture of previous coarse-to-fine methods

  By analyzing the network architectures used in these methods, the author found that these network architectures have the following three defects:

  • Network architectures that ignore ambiguous scales . When deblurring an image, in order to restore the pixel value of a specific pixel, a receptive field larger than the blur size is required. Therefore, larger blur sizes require larger receptive fields or deeper networks. Likewise, finer scales in coarse-to-fine methods require deeper sub-networks. The previous coarse-to-fine model uses the same network architecture at different scales.
  • Inefficient information dissemination across scales . Previous coarse-to-fine methods transfer the pixel values ​​of the deblurred image from a coarse scale to the next. This results in a significant loss of the rich information encoded in the coarse-scale feature vectors and ultimately degrades the deblurring performance.
  • Information loss due to downsampling . When generating multi-scale input blurred images, previous methods construct image pyramids by repeatedly downsampling the input blurred images. But downsampling can lead to severe information loss.

3. Model Design

3.1 Network Architecture

  Based on the previous analysis of previous coarse-to-fine methods, in order to address these shortcomings, the authors propose MSSNet network, a new deep learning-based single image deblurring method, which adopts a coarse-to-fine scheme. Figure 1.2 shows the network architecture of MSSNet. Like the previous coarse-to-fine method, MSSNet consists of three scales, from coarse to fine, denoted as S 1 S_1S1 S 2 S_2 S2and S 3 S_3S3. MSSNet will predict a residual image RRR , then with the blurred imageBBB is added to get a deblurred imageL = B + RL = B+RL=B+R

Figure 1.2 The network architecture of MSSNet

  To address the shortcomings of previous coarse-to-fine approaches analyzed earlier, MSSNet employs three strategies:

  • Phase configuration reflecting fuzzy scales.
  • A cross-scale information dissemination scheme.
  • A multi-scale scheme based on Pixel-Shuffle.

  Next, we take a closer look at the specific implementation of each strategy.

3.1.1 Stage Configuration Reflecting Fuzzy Scale

  To reflect the fuzzy scale, the finer-scale sub-network of MSSNet has a deeper network architecture. The specific implementation is as follows:

  • S 1 S_1 of MSSNetS1 S 2 S_2 S2and S 3 S_3S3There are 1, 2, and 3 stage networks respectively, and each stage network consists of a single lightweight UNet module. We use U ij U_i^jUijrepresents each stage network, where iii andjjj represents the index of scale and stage respectively.
  • Each UNet module shares the same network architecture but with different weights.
  • Each UNet module can generate residual features, which can be converted into a residual image, and the residual image can be added to the blurred image to obtain a deblurred image.

3.1.2 A cross-scale information transmission scheme

  Previous multi-scale networks transfer upsampled deblurred images from coarse to fine to the next scale; while MSSNet transfers upsampled residual features to facilitate efficient information propagation between scales. The specific implementation is as follows:

  • First, bilinear upsampling and 1 x 1 convolution operations are performed sequentially on the residual features output by the coarse-scale network.
  • Then, the above output is concatenated with the features extracted from the fine-scale blurred image, and then connected with a 3 x 3 convolution operation to obtain the fused features.
  • Finally, the fusion features obtained above are sent to the subsequent UNet network.

3.1.3 A multi-scale scheme based on Pixel-Shuffle

  When generating scale input blurred images, in order to avoid information loss caused by downsampling, the author proposes a multi-scale scheme based on Pixel-Shuffle. We first set the blurred image BBThe size of B isW × H × 3 W \times H \times 3W×H×3 , onBBB is down-sampled to getB 2 B_2B2, whose size is W / 2 × H / 2 × 3 W/2 \times H/2 \times 3W/2×H/2×3 . The specific implementation is as follows:

  • For the finest scale S 3 S_3S3, using the blurred image B 3 = B B_3 = BB3=B as input.
  • For S 2 S_2S2, for the blurred image BBB performs an unshuffle operation to obtain 4 sheets of sizeW / 2 × H / 2 × 3 W/2 \times H/2 \times 3W/2×H/2×3 images. These 4 images are then stacked along the channel method, resulting in a tensorX 2 X_2X2, whose size is W / 2 × H / 2 × 12 W/2 \times H/2 \times 12W/2×H/2×12 , take it asS 2 S_2S2input of. We can see X 2 X_2X2and B 2 B_2B2have the same spatial dimensions (i.e. W / 2 × H / 2 W/2 \times H/2W/2×H /2 ), but has andB 3 B_3B3same information. It is because of X 2 X_2X2have richer information, so that S 2 S_2S2able to produce more accurate results.
  • For the coarsest scale S 1 S_1S1, for B 2 B_2B2Perform the same unshuffle processing operation, so that the size is W / 4 × H / 4 × 12 W/4 \times H/4 \times 12W/4×H/4×TensorX 1 X_1 of 12X1, taking it as S 1 S_1S1input of. You may wonder why you don't directly tell BBB performs the unshuffle operation, and the obtained size isW / 4 × H / 4 × 48 W/4 \times H/4 \times 48W/4×H/4×A tensor of 48 is used as input. In fact, the author has also tested this scheme, but found that this scheme will cause a slight performance loss.

3.1.4 Feature fusion across stages and scales

  MSSNet adopts a cross-stage feature fusion scheme. Specifically, the cross-stage feature fusion scheme refers to providing additional connections between adjacent stages (pink dashed line in Figure 1.2) to facilitate more efficient information dissemination between adjacent stages. Figure 1.3(a) shows the feature fusion scheme across stages.

  In addition, MSSNet also adopts a cross-scale feature fusion scheme. Likewise, it provides additional connections between adjacent scales (green dashed lines in Figure 1.2) to facilitate more efficient information dissemination between adjacent scales. Figure 1.3(b) shows the feature fusion scheme across stages.

Figure 1.3 Cross-stage and cross-scale feature fusion scheme

Summary : Since each strategy is simple and clear, MSSNet is determined to be a simple architecture network. Although the MSSNet network architecture is simple, the author has proved through experiments that MSSNet can achieve the best in terms of model performance, network size and calculation time consumption at that time.

3.2 Training and loss function

  During training, an auxiliary layer is used to generate one deblurred image for each stage of MSSNet, i.e. a total of 6 (1+2+3=6) deblurred images are generated. It should be noted that in the reasoning stage, only U 3 3 U_3^3 in Figure 1.2U33to generate a deblurred image. The specific implementation is as follows:

  • For S 3 S_3S3,在U 3 y U_3^yU3jA 3 x 3 convolutional layer is connected behind to generate a residual image R 3 j R_3^jR3j(Its size is W × H × 3 W \times H \times 3W×H×3 ) and then compare it withB 3 B_3B3Adding to get the deblurred image L 3 j L_3^jL3j
  • For S 2 S_2S2,在U 2 y U_2^jU2jA 3 x 3 convolutional layer and a Pixel-Shuffle operation layer are added later to generate a residual image (its size is W × H × 3 W \times H \times 3W×H×3 ) and then compare it withB 3 B_3B3Adding to get the deblurred image L 2 j L_2^jL2j
  • For S 1 S_1S1,在U 1 y U_1^jU1jA 3 x 3 convolutional layer and a Pixel-Shuffle operation layer are added later to generate a residual image (its size is W / 2 × H / 2 × 3 W/2 \times H/2 \times 3W/2×H/2×3 ) and then compare it withB 2 B_2B2Adding to get the deblurred image L 1 j L_1^jL1j
Figure 1.4 Training of MSSNet. During training, each stage generates residual images using auxiliary convolutional and pixel-shuffle layers.

3.2.1 Loss function

  When the author trains MSSNet, the loss function is composed of content loss L cont \mathcal{L}_{cont}LcontAnd frequency reconstruction loss L freq \mathcal{L}_{freq}LfreqComposition, the specific formula is shown in formula (1.1):

L t o t a l = L c o n t + λ L f r e q (1.1) \mathcal{L}_{total} = \mathcal{L}_{cont} + \lambda \mathcal{L}_{freq} \tag{1.1} Ltotal=Lcont+λLfreq(1.1)

where, λ = 0.1 \lambda = 0.1l=0.1

  The content loss function adopts L1 loss, and the specific formula is shown in formula (1.2):

L c o n t = 1 N 1 ∣ ∣ L 1 1 − L g t ↓ ∣ ∣ 1 + ∑ j = 1 2 1 N 2 ∣ ∣ L 2 j − L g t ∣ ∣ 1 + ∑ j = 1 3 1 N 3 ∣ ∣ L 3 j − L g t ∣ ∣ 1 (1.2) \mathcal{L}_{cont} = \frac{1}{N_1} ||L_1^1-L_{gt\downarrow}||_1 +\sum_{j=1}^2\frac{1}{N_2}||L_2^j-L_{gt}||_1 +\sum_{j=1}^3\frac{1}{N_3}||L_3^j-L_{gt}||_1 \tag{1.2} Lcont=N11∣∣L11Lgt1+j=12N21∣∣L2jLgt1+j=13N31∣∣L3jLgt1(1.2)

where L gt L_{gt}Lgtis the real sharp image, L gt ↓ L_{gt\downarrow}Lgtis L gt L_{gt}LgtThe downsampling result of L ij L_i^jLijis the deblurred image generated by each stage; N 1 N_1N1 N 2 N_2 N2and N 3 N_3N3Is the normalization factor, respectively N 1 = W / 2 × H / 2 × 3 N_1=W/2 \times H/2 \times 3N1=W/2×H/2×3 N 2 = N 3 = W × H × 3 N_2=N_3=W \times H \times 3 N2=N3=W×H×3

  The purpose of using frequency reconstruction loss: to recover high-frequency details from blurred images by minimizing the difference between the deblurred image and the real image in the frequency domain. The specific formula is shown in formula (1.3):

L f r e q = 1 N 1 ∣ ∣ F ( L 1 1 ) − F ( L g t ↓ ) ∣ ∣ 1 + ∑ j = 1 2 1 N 2 ∣ ∣ F ( L 2 j ) − F ( L g t ) ∣ ∣ 1 + ∑ j = 1 3 1 N 3 ∣ ∣ F ( L 3 j ) − F ( L g t ) ∣ ∣ 1 (1.3) \begin{aligned} \mathcal{L}_{freq} = &\frac{1}{N_1}||\mathcal{F}(L_1^1)-\mathcal{F}(L_{gt\downarrow})||_1 +\sum_{j=1}^2 \frac{1}{N_2}||\mathcal{F}(L_2^j)-\mathcal{F}(L_{gt})||_1 \\ &+\sum_{j=1}^3 \frac{1}{N_3}||\mathcal{F}(L_3^j)-\mathcal{F}(L_{gt})||_1 \end{aligned} \tag{1.3} Lfreq=N11∣∣F(L11)F(Lgt)1+j=12N21∣∣F(L2j)F(Lgt)1+j=13N31∣∣F(L3j)F(Lgt)1(1.3)

where F \mathcal{F}F is the Fourier transform.

3.3 Model variants

  For the MSSNet model, the author provides three variants, their main difference: the number of channels in the UNet network is set differently. The MSSNet-small settings are 20, 60 and 100, the MSSNet settings are 54, 96 and 138, and the MSSNet-large settings are 80, 130 and 180.

reference:

[1] MSSNet: Multi-Scale-Stage Network for Single Image Deblurring

[2] https://github.com/kky7/MSSNet

Guess you like

Origin blog.csdn.net/benzhujie1245com/article/details/130297141