A review paper on image restoration methods based on deep learning

This article is reprinted from: [CV] Review paper on deep learning methods for image restoration (2022)

Original link: [CV] A survey of deep learning approaches to image restoration_A survey of deep learning approaches to image restoration_A survey of deep learning approaches to image restoration_A survey of image restoration_datamonday 's blog https://blog. csdn.net/weixin_39653948/article/details/124455382

Paper name: A survey of deep learning approaches to image restoration
Paper download: https://www.sciencedirect.com/science/article/pii/S0925231222002089?via%3Dihub
Paper year: 2022
Paper cited: (2022/04/27 )


Abstract

In this paper, we present an extensive review on deep learning methods for image restoration tasks. Deep learning techniques, led by convolutional neural networks, have received a great deal of attention in almost all areas of image processing, especially in image classification. However, image restoration is a fundamental and challenging topic and plays significant roles in image processing, understanding and representation. It typically addresses image deblurring, denoising, dehazing and super-resolution. There are substantial differences in the approaches and mechanisms in deep learning methods for image restoration. Discriminative learning based methods are able to deal with issues of learning a restoration mapping function effectively, while optimisation models based methods can further enhance the performance with certain learning constraints. In this paper, we offer a comparative study of deep learning techniques in image denoising, deblurring, dehazing, and super-resolution, and summarise the principles involved in these tasks from various supervised deep network architectures, residual or skip connection and receptive field to unsupervised autoencoder mechanisms. Image quality criteria are also reviewed and their roles in image restoration are assessed. Based on our analysis, we further present an efficient network for deblurring and a couple of multi-objective training functions for super-resolution restoration tasks. The proposed methods are compared extensively with the state-of-the-art methods with both quantitative and qualitative analyses. Finally, we point out potential challenges and directions for future research.

【Significance】

In this paper, we provide an extensive review of deep learning methods for image restoration tasks. Deep learning technology, headed by convolutional neural networks, has received widespread attention in almost all image processing fields, especially in the field of image classification. However, image restoration is a fundamental and challenging topic that plays an important role in image processing, understanding, and representation.

[Segmented Research Directions in Image Restoration]

It usually handles image deblurring, denoising, dehazing and super-resolution .

【Image recovery method】

Deep learning methods for image restoration vary widely in their methods and mechanisms.

  • Methods based on discriminative learning can effectively handle the problem of learning recovery mapping functions.

  • Methods based on optimization models can further improve performance under certain learning constraints.

【Work of this article】

In this paper, we conduct a comparative study of deep learning techniques in image denoising, deblurring, dehazing, and super-resolution,

  • The principles involved in these tasks are summarized, from various supervised deep network architecturesresidual or skip connectionsreceptive fieldsunsupervised autoencoders mechanisms.
  • Image quality criteria are investigated and their role in image restoration is evaluated.
  • Based on our analysis, we further propose an efficient deblurring network and several multi-objective training functions for the super-resolution recovery task .

【Research result】

The proposed method is extensively compared with state-of-the-art methods of quantitative and qualitative analysis. Finally, we point out potential challenges and directions for future research.

1. Introduction

Image restoration has been a long-term research topic in digital image processing since the last century [1-5] and has remained an active topic in recent years. Image restoration aims to recover clean latent images from degraded observations and is a typical inverse problem . The infinite possible mappings between multidimensional
degraded observations and restored images determine the illposed nature of this inverse problem . For situations where the mapping is known and reversible, the corresponding solution is easy to obtain, but this mapping is unique and lacks universality. In practice, the inverse mapping is unknown, so the solution space is infinite, and regularization techniques need to be applied to arrive at a feasible optimal solution . Therefore, most image restoration research focuses on employing efficient analytical models and learning schemes in order to find accurate mapping approximations to restore degraded images .

Traditional image restoration methods use advanced mathematics and probability models to solve inverse problems, mainly based on maximum likelihood or Bayesian methods in iterative algorithms  [6-8].
Insert image description here
Assuming that the general formula of the degraded image Y is the result of convolving the clean image ], super-resolution [12-14], restoration [15-17], astronomy [18-20], medicine [21-23], microscopy [24-26], etc. There is also growing interest in multi-frame and video restoration that utilizes the relationships between consecutive image frames to reconstruct high-quality clean images and videos [27–32] .

In the past decade, the rapid rise of deep learning (DL) technology has greatly impacted various computer vision tasks, from recognition and classification [38-41] to regression and generation [42-45] . Convolutional neural networks (CNN) first improved the performance of classification and detection [46], and many network architectures have been proposed to solve benchmark research tasks.

  • VGGNet [47] pointed out that deep network architecture is beneficial, while previous research mainly focused on shallow networks [48].
  • ResNet [39] provides the baseline structure for image restoration and becomes the basic structure of several methods, such as
    • EDSR [49] (for super-resolution)
    • DeepDeblur[50] (for image deblurring)
    • DnCNN [35] (for image denoising).
  • DenseNet [51] further improves network performance by developing residual links with densely connected convolutional layers.

Deep learning methods bring many benefits to image restoration, such as,

  • Learning-based methods can improve performance . Deep learning-based methods often significantly outperform traditional methods on most benchmark datasets.
  • Deep learning makes applications more realistic . One can recover the degradation of a video by considering sequential frames or filling in some missing content, whereas the degradation process is impossible to model mathematically (e.g., inpainting).
  • By using parallel processing units such as graphics processing units (GPUs), deep learning algorithms fit naturally into computer hardware and are more efficient than using CPUs .

However, many challenges remain:

  • From a computational complexity perspective, deep learning-based methods have considerable computational costs, making them difficult to deploy in real-time processing. In addition, matrix processing has higher requirements on computer hardware. In terms of GPU and memory, embedded systems commonly used in industry, such as microcontroller units (MCU), cannot meet the requirements.
  • From a performance perspective, there is still a lot of room for improvement in existing algorithms.
  • From a training perspective, deep learning CNN requires large data sets, which are difficult to obtain and label, and may not match the actual situation. For example, many deblurring or super-resolution applications focus more on faces, but most existing training datasets contain relatively few face samples, while many other samples such as cars or buildings may not be helpful for a specific application.

There are also some tasks that are highly related to image restoration , such as  3D reconstruction  [52] and image inpainting  [53]. Ideas and new methods in image restoration can benefit the above tasks and vice versa.

This survey aims to provide a timely update and overview of deep learning methods for image restoration and is organized as follows.

  • Section 2 provides a general review of existing deep neural networks for image restoration, followed by a detailed review of models for deblurring, denoising, and super-resolution tasks. Various image quality assessment criteria are also reviewed and discussed.
  • Section 3 reviews and analyzes typical network architectures and learning strategies. The latest models are briefly considered. We then present some networks for deblurring and super-resolution tasks, along with extensive experiments and comparisons with state-of-the-art models.
  • The final section discusses these networks, performance and results, as well as remaining challenges and concludes the work. Future work and research directions are also proposed.

2. Deep Networks for Image Restoration

2.1. Image Restoration

There are several ways to apply deep learning in digital image restoration. Learning image priors or kernels through deep learning neural networks [54–56,17] is a popular approach . Compared to complex hand-crafted image priors and the extensive work done to derive such priors, learning priors via deep learning neural networks as efficient regularizers for ill-posed problems is more efficient . The  learned priors are integrated into the next stage of optimization to recover the degraded images and help achieve superior performance over those priors based on analytical models . Furthermore, deep learning methods employ various architectures  [59, 60, 56, 61] and learning strategies  [62, 63] to obtain better solutions by leveraging powerful learning capabilities to extract important information from massive training data . Extensive research has been conducted to apply popular deep learning techniques to solve image restoration tasks.

Recently, generative adversarial network (GAN)-based methods have become dominant and surpassed general CNN-based methods, improving state-of-the-art performance [64-66]. The strong compatibility and capacity of GAN models alleviate the burden of specifically designing the network for a specific application, but at the expense of larger and deeper networks and training problems  [67-69]. Furthermore, these advanced networks have achieved significant progress in a variety of applications, including underwater imaging [63,70], light field imaging (fluorescence) [60], fluorescence image reconstruction [71] and computed tomography ( computerized tomography) super-resolution [72] .

2.2. Image Deblurring

Blurry images  are common in practice, and due to various factors, such as unavoidable motion during long exposure times, physical limitations and imperfect systems of imaging equipment, unknown degradation processes, etc. Recovery is intractable . Researchers have put in a lot of effort and are committed to developing effective and novel methods to solve these challenging problems.

Dynamic scene blurs  are ubiquitous in real-life image capture. Blur can be caused by a mixture of camera motion, object motion, and changes in scene depth. Camera motion has two categories of six degrees of freedom, translational and rotational motion. Translational motion is related to depth changes [73, 74], while rotational camera motion and object motion are independent factors that can also cause uneven blur in images . Since these motion blurs vary spatially, modeling the imaging and degradation processes is not a trivial task, especially when only a single blurred image is available . Many attempts have been helpful in building models that approximate the real blur kernel by using prior knowledge and additional observations of the image .

Some studies reviewed representative works and compared the performance of single image deblurring. Wang et al. [75] reviewed traditional methods of image deblurring, defined common blurs appearing in imaging, and classified the methods into five main frameworks based on their respective characteristics. Since learning-based methods were not well developed at the time, neural networks were only considered a promising topic for further research. Lai et al. [76] evaluated and compared 13 single image deblurring algorithms using their own real-world blurred images and human subjects study (Amazon Turk). The recent NTIRE (New Trends in Image Restoration and Enhancement) 2020 Image and Video Deblurring Challenge introduced state-of-the-art methods and provided fair rankings and performance comparisons [77]. The latest survey provided by Koh et al. [78] reviews the development of deep learning-based non-blind and blind deblurring techniques since 2013. In this paper, a comparative study illustrates artifacts caused by perceptual loss, the superiority of explicit image priors, and the potential of unsupervised learning.

Some studies reviewed representative works and compared the performance of single image deblurring .

  • Wang et al. [75] conducted a review of traditional image deblurring methods, defined blurring that occurs in common imaging, and classified the methods into five main frameworks based on their respective characteristics . Since learning-based methods were not well developed at the time, neural networks were only considered a promising topic for further research.
  • Lai et al. [76] evaluated and compared 13 single-image deblurring algorithms by using their own study of real-world blurred images and human subjects (Amazon Turk).
  • The recent NTIRE (New Trends in Image Restoration and Enhancement) 2020 Image and Video Deblurring Challenge introduced state-of-the-art methods and provided fair rankings and performance comparisons [77].
  • A recent survey is provided by Koh et al. [78]. The development of non-blind and blind deblurring techniques based on deep learning since 2013 is reviewed. In the paper, a comparative study illustrates artifacts caused by perceptual losses, the superiority of explicit image priors and the potential of unsupervised learning .

Multi-scale deblurring networks use a "coarse-to-fine" structure to restore images in several steps .

  • It was first proposed in [50], which applies the multi-scale structure developed by Eigen et al. [79].
  • Tao et al. [80] and Gao et al. [81] developed a multi-scale deblurring network, and Zhang et al. [82] made fundamental subsequent changes to the structure and mechanism when adopting it, compared with the other three methods. There are significant differences.

DeblurGAN was proposed by [33], which was the first time that conditional GAN ​​was used for deblurring problems .

  • This method uses a residual network block [39] as the main component of the generator.

  • DeblurGAN-v2 [64] is an updated version of DeblurGAN, using the Feature Pyramid Network (FPN) originally proposed for object detection [83,84] as the generator .

  • The authors of [85] proposed an end-to-end deblurring network consisting of unsupervised CNN . Previously, supervised deep learning networks extensively relied on large amounts of paired data, which is demanding and challenging to obtain, while unsupervised training schemes can achieve comparable performance to unpaired data.

  • [86] proposed another unsupervised network based on separated representation for domain-specific single image deblurring.

The authors of [87] proposed a new type of network called Dr-Net. They used Douglas-Rachford iteration to solve the deblurring problem because it is a more suitable optimization process than the approximate gradient descent algorithm . [88] reported that the recovery of images affected by severe blur requires network design with large receptive fields and proposed a new architecture composed of region-adaptive dense deformable modules that can implicitly discover the causes of input Spatially varying shifts of non-uniform blur in images and learned modulation filters.

Table 1, Table 2 and Table 3 give a comparison of the various methods.
Insert image description here
Insert image description here

2.3. Image Denoising

Image denoising is another important task in image restoration and has extraordinary value for low-level vision in many aspects. First, noise removal is often an essential pre-processing step in various computer vision tasks . Secondly, image denoising is an ideal test bed for evaluating image prior models and optimization methods from a Bayesian perspective [94]. Traditionally, BM3D [91] is a mainstream method that enhances sparsity by grouping similar 2-D image segments (e.g., patches) into 3-D data arrays . Learning-based denoising focuses not only on deep learning but also on other machine learning methods. This difference is due to the fact that noise mechanisms are widely applicable to many signal processing methods. In this paper, we focus on DL-based denoising and its commonalities with other image processing tasks such as dehazing and denoising . For an overview of learning-based image denoising, see [109]. Mathematically, the noisy image Y can be expressed as
Insert image description here
where X represents the real image and N represents the additive noise destroyed by X. Noise can also be multiplicative in nature. Deep CNN began to be applied to image denoising in 2015 [110,111]. The first important work is [112], which first applies a very deep CNN with skip connections. [93] developed a Monte Carlo denoising method with a kernel-splatting architecture.

According to the type of noise, image denoising can be divided into four categories :

  • Additive white noise image (AWNI) denoising
  • Real noise image denoising
  • Blind denoising
  • Mixed image denoising

Of these categories, AWNI received the most attention. However, the popularity of AWNI does not reflect real noise images . Therefore, although AWNI denoising includes Gaussian, Poisson, salt, pepper and multiplicative noise, there is still a gap between it and the actual application scenario .

Relevant ones can be found in a recent overview [109]. In this subsection, we aim to compare learning-based denoising methods with other image restoration tasks. Many ideas and techniques developed for denoising are also applicable to other image inverse problems and vice versa, and many important denoising networks were inspired by existing low-level vision work . For example, DnCNN [35] first proposed residual learning in image restoration. The residual learning here is different from ResNet [39]. It adopts a single residual unit to predict the residual image . Generally speaking, DnCNN uses long residual links to directly connect the input image to the output, so that the network only needs to learn the residual image and does not have to pay attention to the content of the image . Residual learning methods have a great impact on image restoration, and since DnCNN, most deblurring and super-resolution networks use residual links . A comparison of comprehensive image denoising methods is shown in Table 4.
Insert image description here

2.4. Image Dehazing

The atmospheric scattering model is a classic description of hazy image generation:
Insert image description here
where Y is the observed hazy image and X is the haze-free scene radiance to be recovered. There are two key parameters: A represents the global atmospheric light, and t is the transmission matrix, defined as:
Insert image description here
where b is the scattering coefficient of the atmosphere and d is the distance between the object and the camera.

Due to the existence of turbid media such as fog and dust, the presence of haze results in poor visibility of photos and adds data-dependent, complex and non-linear noise to the image, making dehazing an ill-posed and extremely challenging recovery problem. . Many computer vision algorithms only work well on scene radiances without haze.

  • [113] reconstructed an image formation model that considers surface shading in addition to the transfer function.
  • [114] proposed a dark channel prior (DCP) to remove haze from a single image based on the assumption that image patches of outdoor haze-free images usually have low intensity values .
  • [115] is an approach based on early learning.
  • [116] proposed an integrated method to directly generate clean images through lightweight CNN based on a reformulated atmospheric scattering model.
  • [117] introduced GANs for dehazing by using a discriminator to guide a generator to create pseudo-realistic images at coarse scales, while an enhancer following the generator is required to produce realistic dehazed images at fine scales.
  • [118] adopted a smooth expansion technique and utilized gated subnetworks to fuse features at different levels .
  • MSRL-DehazeNet [119] relies on multi-scale residual learning and image decomposition .
  • RYF-Net [120] uses a transfer graph fusion network to integrate two transfer maps and estimate a robust and accurate scene transfer map for haze images.
  • [121] contains a supervised learning branch and an unsupervised learning branch .
  • DCP-Loss [122] uses dark channel prior as the loss function.
  • The authors of [123] proposed a method based on heterogeneous GANs, which consists of CycleGAN for generating clear images and conditional GAN ​​for preserving texture details.
  • Similar work can be seen in Cycle-dehaze [124].
  • FAMED-Net [125] includes three scale encoders and a fusion module to learn haze-free images efficiently and directly.
  • [126] proposed a domain adaptation paradigm consisting of an image translation module and two image dehazing modules.
  • The authors of [127] adopted a novel fusion-based strategy to obtain three inputs from the original blurred image by applying white balance (WB), contrast enhancement (CE), and gamma correction (GC) .
  • Similar to many image deblurring networks, DCPDN [128] adopts a densely connected structure.

Table 5 gives a comparison of learning-based image dehazing methods . see picture 1.
Insert image description here
Insert image description here

2.5. Image Super-resolution

Super-resolution (SR) is a technique for reconstructing high-resolution images that effectively overcomes the inherent limitations of imaging systems [134] . It has attracted widespread attention due to its practical value in a wide range of applications. In the early stages of super-resolution development, the availability of multiple low-resolution (LR) images was considered a basic prerequisite, as were restoration and interpolation techniques, which together contribute to obtaining high-resolution (HR) images. When only one LR image is available, the problem becomes more challenging and is called single image super-resolution (SISR) . Unlike other restoration tasks, after deblurring, an additional upsampling process is required in SR to increase the image dimension and obtain HR images . Based on Equation 1, degradation applies the downsampling operator D after blurring, as shown in Equation 7. The observation model is shown in Figure 2.
Insert image description here
Insert image description here

2.6. Image Quality Assessments

Image qualitu assessment (IQA) is critical for determining image quality, image processing algorithms, and imaging systems . Only by providing uniform quality measures can fair comparisons be made that reflect the characteristics and properties of algorithms and systems with convincing and reliable evidence. Initially, image quality measurement (IQM) was mainly used to evaluate image compression and acquisition technologies, and then generalized to other image processing tasks and image communication networks [153]. Since the ultimate recipient of an image is a human, the most reliable assessment of the visual quality of an image is a subjective human study by collecting ratings from a large number of test examples. However, conducting such studies to provide a quality assessment for each case in practice is time-consuming and often too expensive. Therefore, there is a great need for an objective IQA that aims to effectively predict perceptual quality while correlating with human visual system (HVS) responses.

The most common classification of objective quality measurements is based on the availability of reference images , i.e.

  • Full-reference (FR) quality measure : calculates the similarity between the distorted image and the reference image.
  • Reduced-reference (RR) quality measurement : The RR measurement is applied when partial information from the reference image is available.
  • No-reference (NR) quality measurement : NR measurement uses image statistics to evaluate image quality because information from the reference image is completely unavailable.

The simplest objective FR measurement is the peak signal-to-noise ratio (PSNR) , based on the mean square error (MSE) between the reference image and the degraded image.

  • Despite widespread adoption, image fidelity measures like PSNR are known not to correlate well with visual quality [154, 155].

  • [156] introduced the structural similarity index measure (SSIM). The quality assessment of HVS is further approximated by exploiting its sensitivity to changes in structural information. There are some variants of SSIM, such as multi-scale SSIM [157], three-component SSIM [158] and four-component SSIM [159], which are further developed for generalization.

  • In addition, information theory can also be introduced to derive image quality assessment, such as the information fidelity criterion (IFC) proposed by [160].

  • This was followed by extended work on the Visual Information Fidelity Measure (VIF) [161].

  • In addition, measurements such as Feature Similarity Index Measure (FSIM) [162], DCTune [163], wavelet-based distortion measurement [164], Haar wavelet-based perceptual similarity index (HaarPSI) [165], etc., utilize data from other domains. image features to approximate the HVS response.

  • Many studies have provided valuable reviews on FR IQA [166–168, 160,169–171].

  • RR IQA measures are suitable when there is partial information from a reference image or degradation process, and can be considered as an intermediate case inspired by FR and NR IQA measures [172-175].

Representative FR and RR methods with equations are given in Table 6 .
Insert image description here
NR IQA measures are useful when the original reference image is not available for quality assessment. A common feature adopted by most NR IQA measures is natural scene statistics (NSS) [177,178], which has invariant properties across various degradations and image contents. These measures include:

  • Blind/No Reference Image Spatial Quality Evaluator (BRISQUE) [179]

  • Distortion Identification Based Image Authenticity and Integrity Evaluator (DIIVINE) [180]

  • Natural Image Quality Evaluator (NIQE) [181].

Figure 3 provides the pipeline of representative NR methods: BRISQUE, BLIINDS-II [182], DIIVINE, and NIQE. NR IQA also adopts other features, such as NSS in the DCT domain [183,182], NSS in the multivariate Gaussian model [184], gradient magnitude [185,186], etc. The perceptual index (PI) proposed in [187] combines two NR methods ([181, 188]) for perceptual evaluation of the generated images.
Insert image description here
During the development of IQA measures, many studies reported conflicts between distortion-based measures and perceptual quality measures. Therefore, the trade-off between perception and distortion is systematically illustrated in [189]. Relevant studies have been conducted to analyze this trade-off [190, 191], and the discussion concluded that image quality can be improved in terms of fidelity or perceptual quality, but at the expense of each other .

Recently, deep learning has been developed as an alternative paradigm for IQA, which learns to map images to numerical scores of distorted images from a training set through human subjective quality assessment (i.e., Mean Opinion Score (MOS) or Differential Mean Opinion Score (DMOS)). [192 –200] . End-to-end training enables deep neural networks to achieve better prediction accuracy than previous manual methods . However, considering the difficulty of collecting ground truth MOS/DMOS values, performance optimization and generalization capabilities are limited by the limited training set . Model complexity and fine-tuning network hyperparameters is also an important and critical task for generalizing deep learning based methods.

3. Network Architectures and Learning Strategies

3.1. Baseline Models

Multilayer Perceptron (MLP) [201] is one of the earliest artificial neural networks used for image restoration [202-206]. Restoration requires input and output images of the same dimension, and the MLP framework (Fig. 4a) follows the structure of a fully connected network and is able to learn a high-dimensional mapping between the degraded input image and the clear latent image . However, MLP is inefficient due to the redundancy of a large number of parameters that burdens computing resources and storage . Furthermore, MLP ignores cross-channel spatial information and the content of multidimensional images , which is another obstacle to further development.

Considering the structural characteristics of the image and the shortcomings of MLP, convolutional neural networks (Fig. 4b) (e.g. [207]) are adopted and provide a more suitable solution for image restoration. Convolutional Neural Networks (CNN) have the advantages of shared weights, architectural sparsity, training stability, and hierarchical feature extraction, thus achieving extraordinary performance and becoming a new state-of-the-art method . Although it is observed that increasing network depth benefits the model performance of CNN through large receptive fields and meaningful hierarchical features, training stability and computational resources become thorny issues. Therefore many advanced technologies are used to deal with these problems. Residual learning and skip connections were invented to stabilize training [39] . Residual blocks (Figure 4c) effectively improve performance and become the new building blocks of many deep residual networks (Figure 4d) . Other network paradigms include encoder-decoders, autoencoders, and variational autoencoders under unsupervised learning schemes aimed at learning high-level sparse representations from training data (Figure 4e) . The multi-scale network (Fig. 4f) specializes in handling degradation at various scales . Generative adversarial networks  (GAN) (Fig. 4g), as shown in [208], combine the advantages of generative modeling and adversarial learning to produce plausible textures in the generated images. Since the paired training images required in GAN-based models are difficult to obtain, unpaired training has been proposed, such as cycleGAN [209], where cycle consistency loss is designed as a regularization technique to generate high-quality images (Fig. 4h). To prevent the mode collapse problem, disentanglement representation is adopted in [210] (Fig. 4i) to provide a new alternative to generate different output images without aligned training data .
Insert image description here
Insert image description here

3.2. Learning Strategies

3.2.1. Supervised, semi-supervised and unsupervised learning

It is common and straightforward to employ supervised learning to train neural networks as long as labeled data is available. Minimizing the cost function and backpropagation through network layers enables powerful learning capabilities under effective supervision. It encourages the network to converge towards the target distribution and produce the desired output. Typical applications are classification and regression with the purpose of prediction or inference. However, for training deep neural networks, supervised learning is prone to overfitting and/or poor generalization ability due to complex underlying mapping functions and limited training data . To mitigate these issues, techniques such as early stopping [212], dropout [213] and weight sharing [214] are used to regularize model complexity, which have become necessary in designing and training deep neural networks today. Furthermore, collecting matching image pairs to train deep networks in image restoration is time-consuming.

Unsupervised learning can discover underlying structures and patterns in data, providing potential insights into the mapping between inputs and outputs . Therefore, it is useful to first learn representative features, and then the learned features can be used for other tasks or generative models under supervised learning [215-217, 210, 218, 209] . The reconstruction loss between the original input and the reconstructed output is important for unsupervised learning to exploit the representation power of deep networks. Through dimensionality reduction and reconstruction, autoencoders adopt an encoder-decoder structure to learn sparse representations of images [112, 219] . For domain transfer tasks like image-to-image translation, unsupervised mechanisms are essential [220-222] . In practice, the amount of labeled or paired training data is always scarce. To exploit large amounts of unlabeled data and small amounts of labeled data, semi-supervised learning [223] exploits the inherent advantages of supervised and unsupervised learning. Under supervision, deep networks are able to generate desired outputs from training data, but their performance is also limited. Unlabeled data is cheap and easy to obtain, and unsupervised and semi-supervised learning use them to improve the performance of the network in terms of accuracy and generalization ability. And it has been shown that, under prespecified assumptions, unsupervised learning can outperform supervised learning for certain classes of problems [224-230]. The authors of [121] employed semi-supervised learning in a CNN containing a supervised branch and an unsupervised branch for single image dehazing . The authors in [231] used semi-supervised learning to train a deep CNN for single-image rain removal and achieved superior performance to state-of-the-art methods.

3.2.2. Autoencoder and adversarial networks

Autoencoder: An autoencoder is a neural network used to learn efficient encoding or representation of data in an unsupervised or self-supervised manner . The purpose of an autoencoder is to learn a representation of a set of data in a reduced-dimensional space . Along with dimensionality reduction, the reconstruction part of the autoencoder attempts to generate a representation as close as possible to its original input from the dimensionality reduction encoding . Many variations of autoencoders exist that aim to force the learned representations to assume useful properties. For example, regularized autoencoders (sparse, denoising and shrinking), which are effective in learning representations for subsequent classification tasks. Autoencoders and aariational auto-encoders can be used as an integral part of the generated model. Autoencoders are widely used in image denoising [232, 233] and super-resolution [234-236]. Deblurring networks like [237] are also related to autoencoders. Specifically, the authors use GANs to generate blurred images as representations of clear images given input, and use the reconstruction part of the autoencoder as a deblurring network.

Adversarial Network: The generative adversarial network introduced by Goodfellow et al. [208] defines a game between two competing networks: the discriminator and the generator. The generator receives information from the input and generates samples. The discriminator learns from real and generated samples and tries to differentiate between them. The goal of the generator is to fool the discriminator by producing perceptually convincing samples that are indistinguishable from real samples. The game between generator G and discriminator D has the following minimax objective:
Insert image description here
where  Pr is the data or real sample distribution, and Pg is the generator model distribution . GANs are known for their ability to generate samples with good perceptual quality in vision tasks. However, training of vanilla versions of GANs often suffers from many problems, such as mode collapse and vanishing gradients , as described in [67]. Minimizing the value function in a GAN is equivalent to minimizing the Jensen-Shannon (JS) divergence between the data and model distributions on a . [238] discussed the difficulty of GAN training caused by the JS-divergence approximation and proposed using the Earth-Mover (also known as Wasserstein-1) distance W ( q , p ) W(q,p)W(q,p) . The value function of Wasserstein-GAN is constructed using Kantorovich-Rubinstein duality [239] :
Insert image description here
where D is the set of 1-Lipschitz functions and Pg is the model distribution.

The idea here is that the critical value is close to K × W ( P r , P θ ) K \times W(P_r,P_{\theta})K×W(Pr​,Pθ​), where K is the Lipschitz constant, W Pr; W ( P r , P θ ) W(P_r,P_{\theta})W(Pr​,Pθ​) is the Wasserstein distance. In this case, the discriminator network is called critic, which approximates the distance between samples. In order to enforce the Lipschitz constraint in WGAN, [238] added weight clipping for [-c, c]; [240] suggested adding a gradient penalty term in the value function as an alternative to enforcing the Lipschitz constraint: this method is useful for
Insert image description here
generating The choice of processor architecture is robust and requires little hyperparameter tuning. This is crucial for image deblurring as it allows the use of a lightweight architecture instead of the standard Deep ResNet architecture [39] that was previously used for image deblurring [50]. GAN-based methods are also popular in denoising [241-245] and super-resolution [36, 246-250].

3.3. State-of-the-Art models

Learning-based single image restoration is still an active topic. In addition to motion deblurring, defocus deblur has attracted more and more attention. For example, [251] exploited the data available on Dual Pixel (DP) sensors on most modern cameras, [252] proposed to perform in feature space by combining the classic Wiener deconvolution framework with learned deep features Explicit deconvolution process.

An important direction for further development of image restoration is to combine image processing and deep learning/machine learning methods . For image super-resolution, researchers began to focus on different scenarios. For example, a super-resolution network can be trained using internal data as if it were a single image, which is called zero-short super-resolution (ZSSR) [253]. MZSR [254] speeds up training by adding a meta-training stage. speed. Graph neural networks are also starting to give way to super-resolution [255]. For image denoising, [256] introduced a self-supervised denoising framework called Nose2Siame, and proposed a new self-supervised loss by deriving the self-supervised upper limit of typical supervised losses . Nose2Siame requires neither J-invariance (which may lead to worse denoising models) nor additional information about the noise model, and therefore can be used in a wider range of applications.

4. Proposed Networks

4.1. Super-resolution

Since multiple loss components are involved in the training objective function, various losses need to be minimized simultaneously. Linear combination is the most direct method, but the combined loss after weighting and addition may not be convex, making it difficult to derive the optimal solution through gradient descent . We assume that the multidimensional loss space naturally formed by multiple loss components is Euclidean, and each individual loss represents an independent dimension in the space. We propose that the training objective of minimization can be defined as the Euclidean distance between the loss location and the origin, or as a hypervolume bounded by the loss and the loss boundary (Figure 5) . Therefore, complex multi-objective optimization problems are transformed into single-objective optimization. Table 7 describes the mathematical formula.
Insert image description here
Insert image description here
A common feature of the Ed and Hypervol formulations proposed in [257] is the gradient weighting learned during training, and the automatic importance assignment to each individual loss. Compared with the manual fine-tuning of weighting parameters adopted by most existing methods, the proposed Ed formulation of the training objective function provides an alternative to optimize model performance for a given model structure.

The two methods have different gradient weighting factors. Euclidean distance-based schemes utilize the projection of the individual loss onto the Euclidean distance between the origin and the location of the loss. The gradient weighting factor of the Hypervol formula calculates the reciprocal of the distance between the loss and the corresponding loss bound. As can be seen from Table 7, the equation of the Ed formula is more concise, and no additional hyperparameters need to be predefined, while the Hypervol formula requires the loss boundary lk to be determined before implementation.

We applied the proposed method on the baseline model SRGAN [36] and adopted the same implementation details as in the SRGAN paper. On the basis of the adversarial loss Ladv and perceptual loss LX given in the original paper [36], we also added MSE loss LMSE and SSIM loss LSSIM as additional constraints to form a multi-dimensional loss space. The loss function is as follows,
Insert image description herewhere ILR represents the input low-resolution image, IHR represents the high-resolution image, ϕ i ; j \phi_{i;j}ϕi;j​ is the j-th before the i-th maximum pooling layer in the VGG19 network The feature map obtained after convolution activation. Wi;j and Hi;j are the dimensions of the feature map.

For the original formula of the training objective function, based on the equation given in [36], L = 10^-3 Ladv + LX, adding MSE loss LMSE or SSIM loss LSSIM with a weight of 10^-2 to form a multi-objective Training Function,
Insert image description here
We tested our method on four SR datasets Set5 [258],,BSDS100 [259], DIV2K validation set [260], and RealSR,[261]. Quantitative results are given in Table 8. For visual comparison, example test images and patches are shown in Figure 6. We used four quality assessment measures: distortion-based PSNR, SSIM and perception-based VIF, PI. For the PSNR, SSIM and VIF metrics, the higher the measurement value, the better the quality of the test image, while the lower the PI (Perceptual Index) the better the quality . The description and calculation of PSNR, SSIM and VIF indicators are shown in Table 6. PI is calculated as follows using Ma's score [188] and NIQE [181]. From
Insert image description here
Insert image description here
Table 8, we can find that the proposed training objective function Ed formula improves model performance. aspect is effective . Compared with the original formula defined using fixed loss weights, the proposed method and Hypervol formula make good use of the adaptive weights and achieve better performance . Furthermore, research shows that it is beneficial and necessary to use other losses as additional constraints for regularization. Among them, SSIM loss is the most meaningful loss for training GAN models to generate high-quality images.. As reflected in the quality assessment of images generated by models trained with different training objective functions, using the training objective function f ( L adv , LX , LSSIM ) f(L_{adv}, L_X, L_{SSIM})f( Ladv​,LX​,LSSIM​) model; usually produces image ratios using the training objective function f (L adv , LX , LMSE ) f(L_{adv}, L_X, L_{MSE})f(Ladv​,LX ,LMSE) has a higher score. Experiments from training objective function f ( L adv , LX , LMSE , LSSIM ) f(L_{adv}, L_X, L_{MSE}, L_{SSIM})f(Ladv​,LX​,LMSE​,LSSIM​) , we can find that compared to the linear combination of multiple loss components, the proposed Ed formula provides an effective alternative to improve the quality of the generated images .
Insert image description here
For visual evaluation, Figure 6 contains example results of extracted patches for texture detail observation (finer details can be found when zooming in), as the overall visual difference is too insignificant to distinguish. We can observe that the Ed formula introduces blurring artifacts, while the Hypervol formula is able to recover finer details. Furthermore, while the SSIM loss is useful for quantitatively improving image quality, visual image quality does not benefit from its involvement, as strange artifacts are evident in small patches. While MSE loss is more suitable to be adopted in various formulations and generate high-quality images .

4.2. Deblurring

In this paper, we also propose a deblurring network named MixNet . We adopt a densely connected encoder-decoder structure to pursue strong deblurring performance. We removed all parameter sharing used in [81]. This does not mean that parameter sharing is useless, but the combination of parameter sharing in within-scale, across-scale and multi-scale structures can lead to more constraints on the optimization of shared parameters, which can lead to poorer performance . Additionally, we removed multi-scale structures to further simplify the network.
Insert image description here
Network architecture: As shown in Figure 7, the network consists of convolutional layers, DenseBlocks and Inception-A blocks, where Ib and Is represent the input blur image and the output image respectively . The backbone of the network uses DenseBlocks and ResBlock as elements to enhance the receptive field . By default, each nonlinear DenseBlock module has four processing units. The structure of the ResBlock used is shown in Figure 7, including two convolutional layers. For the framework, we remove the multi-scale structure and parameter sharing mechanism, thus simplifying the network. The encoder-decoder structure is based on 12 DenseBlocks with independent parameters. We add four Inception-A blocks in the middle of the network to enlarge the feature map . Inception-A is one of the components of Inception-v4, has an input size and feature map width suitable for image restoration, and contains 382 channels. For Inception-v4, there are five types of blocks: Inception-A, Reduction-A, Inception-B, Reduction-B, and Inception-C. But these blocks, except Inception-A, have more than 1000 features and are therefore not suitable for image reconstruction. By adopting inception-A, the proposed network has a hybrid feature extraction mechanism from DenseNet and Inception-A networks, hence it is called MixNet. Unlike using a kernel size of 5 × 5 [50,80], we use a kernel size of 3 × 3 to control the model size, since 2 layers with 3 × 3 kernels can cover the same receptive field as 5 layers with 5 × 5 cores, but saves approximately 25% of parameters .

The loss function is another important element of the image deblurring network . According to the review in Section 2, MSE loss is known as the most important loss for image deblurring. It is directly related to PSNR, which is one of the most important measures in performance evaluation , as shown below:
Insert image description here
Therefore, in this work, we adopt MSE loss as the loss function. In our experience, adding other auxiliary losses (such as SSIM loss or adversarial loss) may not always have a significant impact on deblurring.

Implementation: We implemented MixNet proposed by TensorFlow [264] on NVIDIA Tesla P100 GPU. A randomly cropped real image from a blurred 256×256 region and its same position is used as training input. Set the batch size to 16 during training. All weights are initialized using the Xavier method [265] and biases are initialized to zero. The network was optimized using the Adam method [266] with default settings of beta1=0.9, beta2=0.999, epsilon=10^-8. The learning rate is initially set to 0.0001 and then decays exponentially to 0 using powers of 0.3. According to experiments, 2000 epochs are enough for the network to converge .

Results and Comparison: We conducted experiments on the proposed MixNet and compared it with state-of-the-art methods for dynamic scene deblurring and non-uniform deblurring on the GoPro dataset. Compared methods include DeepDeblur [50], Scale Recurrent Network (SRN-Deblur) [80], DSHMN [82], DeblurGAN [33], DeblurGANv2 [64], unsupervised deblurring [85], domain space [86], SVRNN [89], Dual  Residual  [90], Douglas-Rachford network [87], Region Adaptation [88] and Blur2Flow [262]. Results are generated from a model trained on the default GoPro training dataset and then tested on the GoPro test dataset. For the unsupervised learning method, unsupervised deblurring and domain specific, we used blurred images from the training dataset, and clear images from the new GoPro dataset with higher resolution. For kernel-based methods, including Blur2Flow and optimization-based methods, we tested them on their published model code. Quantitative results and evaluations are shown in Table 9.
Insert image description here
In general, unsupervised learning methods lead to low PSNR and SSIM, as expected, since no supervision (ground truth) is involved and the dataset size of unsupervised learning is limited. Furthermore, these networks were mainly developed to explore this training mechanism, while the network structure is largely underdeveloped. Although Dr-Net has the best SSIM performance, the proposed MixNet achieves state-of-the-art performance on all other evaluation criteria. Furthermore, our MixNet achieves a good balance between running time and performance, while Dr-Net's running time is twice that of MixNet. A visual comparison of the GoPro evaluation dataset is shown in Figure 8. As shown in the figure, the proposed model generally produces better results than other methods. Domain-specific network is a representative of unsupervised deblurring, which obviously has certain distortion in color.
Insert image description here
We also evaluate and compare our method on the HIDE dataset, and the quantitative results are shown in Table 10. These results are generated by a model trained on the default HIDE training dataset. As shown in the figure, the proposed MixNet outperforms or matches these state-of-the-art methods in all these evaluation criteria.
Insert image description here

4.3. Contributions

Here we summarize the main contributions of this paper.

  • Comprehensive review

    A comprehensive literature review is conducted on image restoration, image deblurring, image denoising, image dehazing, super-resolution and image quality assessment. All corresponding baseline depth models were also reviewed.

  • New methods for super-resolution and deblurring

    proposed a new formula for the GAN training objective function for super-resolution as an extension of the Hypervol formula [257]. A new balanced image deblurring method-MixNet is proposed.

  • Experimental verification

    Extensive experiments were conducted to apply the Ed and Hypervol formulas to various training objective functions of SRGAN to achieve super-resolution, and improvements were obtained. Experiments are conducted to compare the proposed MixNet with state-of-the-art deblurring methods, and the results show that MixNet has better performance compared to mainstream image deblurring networks.

5. Conclusions and Discussion

Image denoising is a challenging image processing task because of its ill-posed nature . Traditional methods rely on handcrafted models to build degradation mechanisms and noise models . In practice, degradation mechanisms and noise models are rarely simple and unified. As a result, learning-based methods become more applicable and often significantly outperform traditional methods . Deep learning networks are particularly popular, and there has been a lot of research on everything from deblurring, denoising and dehazing to super-resolution. We provide a comprehensive review of methods for these tasks and summarize typical and useful mechanisms and their benefits for various recovery tasks. We also propose certain training objectives and new formulations for super-resolution, as well as efficient deep networks for blind single-image deblurring . Experimental results demonstrate their benefits and improved performance over state-of-the-art methods.

Image denoising is of extraordinary value for low-level vision and signal processing and can serve as an ideal testbed for evaluating image priors and optimization methods from a Bayesian perspective . Traditionally, image denoising methods have been inspired by existing signal denoising methods. Learning-based methods can significantly improve performance and recover clear images with little prior knowledge. However, recent studies are mainly based on synthetic noise images rather than real-world noise images, whose distribution is unknown and is rarely Gaussian as commonly assumed. Real-world image denoising will remain a challenge.

Fine-detail reconstruction is the ultimate goal of image deblurring, using expert knowledge such as natural scene statistics or features learned by deep neural networks . These image priors are used as a regularization technique to solve the image deblurring problem, adding to the cost function to minimize during optimization. Since dynamic scene blur inherently changes throughout the image, it is difficult to employ manual methods to estimate kernels for recovery. In contrast, training end-to-end deep neural networks has strong learning capabilities and can approximate the mapping between degraded input images and clean output images. However, careful calibration of network architecture and parameters is necessary, and dedicated efforts are required to optimize model performance.

Reconstructing a high-resolution image from a degraded low-resolution version involves not only dimensionality increase but also deconvolution . Based on a given image, the pixels to be interpolated are estimated through classical bicubic function or deep neural network learning . Deep learning neural networks have been shown to produce superior performance. Unsampled layers in deep neural networks insert subpixels and increase image dimensions, allowing the image to be rescaled and further improve quality with finer detail and texture . In addition, since the training objective function often uses various losses as constraints on the solution space, it also determines the direction of optimization and thus performance. Loss based on image quality metrics is proven to be effective in improving model performance and is simple and convenient to adopt.

Although most deep learning methods are based on supervised learning, unsupervised or semi-supervised learning can also benefit image restoration in terms of better data representation and is therefore increasingly combined with supervised mechanisms. Further developments in this direction could make recovery tasks more efficient and effective. Graphs and self-organizing structures can be integrated into supervised learning schemes in deep networks, making ill-posed inverse problems more tenable, more efficient, and less dependent on large numbers of paired training samples .

Guess you like

Origin blog.csdn.net/SmartLab307/article/details/132803572