[Image Restoration] Paper Reading Notes ----- "Image inpainting based on deep learning: A review"

Original download address

Original download link 1: https://www.sciencedirect.com/science/article/abs/pii/S0141938221000391

Original download link 2: http://s.dic.cool/S/KSS4D4LC


overview

This review of papers will be published in 2021. The article summarizes the repair methods of different types of neural network structures based on deep learning, and then analyzes and studies important technical improvement mechanisms; it comprehensively reviews various algorithms from the aspects of model network structure and restoration methods. And select some representative image restoration methods for comparison and analysis. Finally, the current problems of image restoration are summarized, and the future development trend and research direction are prospected.

Current research in image inpainting mainly includes tasks such as inpainting rectangular block masks, irregular masks, object removal, denoising, watermark removal, text removal, scratch removal, and old photo colorization . The above eight image restoration tasks are shown in the figure below:
insert image description here


Traditional Image Restoration Methods

Patch-based and diffusion-based methods, developed before 2014.

Diffusion-based method: The pixel information around the damaged hole in the image is gradually diffused, and a new texture is synthesized to fill the hole.

Patch-based methods: Find the best matching similar patches in the visible region of the image, and then copy the information to fill the missing regions at the pixel level. Sometimes there may be no content similar to the missing area in this image, which requires searching for images semantically similar to the target damaged image in the existing image database, and then selecting appropriate patch information for transplantation and borrowing.

Adversarial Generative Network-Based Methods

The paper summarizes the methods based on the confrontation generation network into three categories: single-stage inpainting, progressive image inpainting and inpainting based on prior knowledge.


single stage repair

There are two categories: single-result inpainting methods and multivariate inpainting.

single result repair

(1) Context-encode

Model architecture:


Pathak et al. propose an image inpainting network named Context Encoder, which applies context-based pixel prediction-driven unsupervised feature learning to large hole image inpainting. The overall architecture is a simple encoder-decoder. The encoder extracts feature representations of the input image, and the decoder progressively upscales the compressed feature maps to restore the size of the original image. A method for propagating information across channels with fully connected layer groups based on stride-1 convolution is proposed as an intermediate connection between encoder and decoder to propagate information within the activity of each feature map.

The context encoder employs reconstruction loss (L2) and adversarial loss to handle continuity within the context and multiple patterns in the output. The reconstruction loss is responsible for capturing the overall structure of the inpainted region and its consistency with the surrounding visible regions, and the adversarial loss makes the prediction of the inpainted region look realistic.

(2)Globally and Locally Consistent

Model architecture:

This method addresses the shortcomings of contextual encoders (handling fixed low-resolution images, the mask region must be located in the center of the image, and the entire region cannot maintain local consistency with surrounding regions).

The network is trained with two auxiliary contextual discriminators, where the global discriminator network takes as input the entire image, while the local discriminator only takes as input a small region around the completed region to ensure the recovery of the global and local semantics of the image, respectively. Expanded convolution is used to complete the middle four layers of the network to increase the receptive field of extracted features.

(3) Partial Convolutions

When using standard convolutional networks to repair damaged images, the average value of effective pixels and missing parts is usually used as filling, which tends to make the large hole repair area lack texture information, produce artifacts such as chromatic aberration and blurring, and seriously affect the visual perception.

Liu et al. proposed partial convolution to solve the above problems. In the mask update process, the convolution results depend on the non-damaged regions of each layer and the binary masks corresponding to the damaged regions, by continuously updating enough layers, and finally only retain the features obtained by the pixel convolution of the visible regions .

(4) Pyramid-context Encoder

Zeng et al. proposed the Pyramid Context Encoder Network (PEN Network), which is assisted by a pyramid context encoder, a multi-scale decoder, and an adversarial training loss, which can fill in missing regions at the image level and feature level to improve image inpainting capabilities.

Main innovation points:

  • An attention transfer network is introduced to learn the similarity in high-level feature maps between damaged regions and visible region blocks, and then the visible region-related features are transformed into low-level high-resolution feature maps according to block similarity weights to fill in missing content , thus ensuring visual and semantic consistency of image restoration.

  • A multi-scale decoder with deep supervision with pyramid dropout and adversarial dropout is proposed. Through skip connections, the similarity features learned by the attention transfer network are decoded together with the latent features to obtain inpainted images.

(5)PRVS (Progressive Reconstruction of Visual Structure)

PRVS (Progressive Reconstruction of Visual Structure) introduces visual structure reconstruction (VSR) layers on top of partial convolutions. Two VSR layers are deployed in the encoder and decoder respectively to generate structural information at different scales.

By gradually merging structural information into features, a reasonable structural image is output based on a generative adversarial network, and the transposed convolution is introduced into the original partial convolutional layer of the decoder sampling layer to solve the limitations of the partial convolution of existing modules. During image restoration, partial convolutions and bottleneck blocks are used to restore some edges in missing regions, and then the reconstructed edges are combined with the input image with holes to gradually reduce the holes by filling in semantically meaningful content. size, and finally obtain fine image inpainting results.

**(6)Recurrent Feature Reasoning **

This method uses the correlation between adjacent pixels, enforces the constraint of estimating deep pixels, iteratively infers the hole boundaries of convolutional feature maps, which are then used as clues for further inference. This module not only significantly improves network performance, but also bypasses some limitations of progressive methods, that is, the input and output of the network need to be represented in the same space.

A Knowledge Consistent Attention (KCA) module is proposed, which can adaptively combine scores from different recurrent processes and ensure consistency between patch exchange processes in recurrent processes, leading to better results with fine details.

(7) Mutual Encoder-Decoder

An interencoder-decoder CNN for joint structure and texture restoration is proposed. The structure and texture of the input image are represented using deep and shallow CNN features from the encoder, respectively. The deep features of the encoder are passed to the structure branch to contain structural semantics, while the shallow features are passed to the texture branch to contain texture details.

Each branch will use multiple scales of CNN features to fill holes, concatenate the CNN features of the two branches, then first reweight the channel attention, and use the bilateral propagation activation function to achieve spatial equalization at different CNN feature levels, the decoder skips Generate inpainted images through concatenation.

multivariate repair method

(1) Pluralistic Image Completion

During the training phase, one of the reconstruction paths obtains a prior distribution of missing regions by reconstructing the entire original image using the real original image parts of the masked regions used for adversarial training. Another generative path regularizes the distribution obeyed by the encoder latent vector by using a prior distribution, which amounts to adding additional constraints to the encoder latent vector. It is this coupled design strategy that enables generative pathways to obtain a complete picture of individuality.

During the testing stage, the reconstruction paths are discarded, and the generated paths can utilize the limited conditional prior distribution to inpaint the input mask image to obtain diverse high-quality images.

(2) FLYING

The network consists of upper and lower branches. The main branch consists of a diversity mapping module and a generation module. The master branch is responsible for mapping the instance image space to the conditional completion image space. The secondary branch acts as a conditional label in the network model and mainly consists of a conditional encoder module. In this model, different inpainted images can be obtained by inputting different instance images, and multiple images with the best restored effect can be returned by discriminator evaluation and ranking by comprehensive score of multiple losses.


Progressive Image Restoration

Divided into two categories: low-resolution image inpainting and high-resolution image inpainting

Low Resolution Image Restoration

**(1) Contextual Attention **

Yu et al. propose a spatially discounted reconstruction loss to improve the visual quality of large hole restoration. It designs a two-level network architecture from coarse to fine, which is a feed-forward fully convolutional neural network without a batch normalization layer.

The network is divided into two stages:
[External link image transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the image and upload it directly (img-WFTl6i2l-1662377290182) (C:/Users/Husheng/AppData/Roaming/Typora/typora-user-images/ image-20220905154501606.png)]
the holes filled with white pixels and their correspondingly sized binary masks in the first stage are used as input pairs to the coarse network in the first stage. Dilated convolutions are also used in the coarse network to effectively increase the receptive field size, and a reconstruction loss is used to stabilize training.

The fine network in the second stage uses the coarse prediction in the first stage as input, and uses the modified WGAN-GP loss on the global and local output in the fine network stage to enhance global and local consistency, combined with the spatial attenuation reconstruction loss Co-direct model training to learn finer image detail features than coarse networks. The fine network structure has two parallel encoders.

The upper encoder introduces a contextual attention layer and uses visible region blocks as convolutional filters to process the generated blocks, focusing on extracting background regions of interest. The lower encoder imagines the contents of missing regions by dilated convolutions. After the outputs of the two encoders are aggregated, they are fed into the decoder to reconstruct the recovered image.

(2) Gated Convolution

Gated convolution replaces traditional convolution in the network and better solves the problem of treating all inputs in vanilla convolution as legal pixels, providing learnable dynamic feature selection for each channel at each spatial location mechanism. The combined SN-patchGAN accelerates model training and adds user guidance, enabling the new method to produce better quality and more flexible inpainting results than contextual attention.

(3) Coherent Semantic Attention

Liu et al. used the U-Net architecture in both coarse and fine stages, and proposed a Consistent Semantic Attention (CSA) layer that focuses on the semantic correlation and feature continuity of hole regions in the fourth layer of the fine network encoder.

Aiming at the limited ability of perceptual loss to optimize convolutional layers in image inpainting, which may mislead the training of CSA layers, a consistency loss is introduced to address the consistency between feature maps of corresponding layers in the encoding and decoding stages. And a feature patch discriminator combined with a 70×70 patch discriminator is introduced to accelerate and stabilize model adversarial training, so that the thinning network synthesizes more average high-frequency details.

**(4) PEPSI **


PEPSI adopts a structure consisting of a shared encoding network and a parallel decoding network with a coarse path and an inpainting path, which can reduce the number of convolution operations and largely solve the problem of high computer resources occupied by coarse and fine network image inpainting . Improves the traditional Contextual Attention Module (CAM). Euclidean distance is used instead of cosine similarity to compute similarity scores for foreground and background patches. A region ensemble discriminator (RED) is introduced to process multiple feature regions separately to solve irregular holes in real scenes that can handle arbitrary positions, shapes and sizes.

High Resolution Image Restoration

(1) Contextual Residual Aggregation

The contextual residual aggregation mechanism proposed by Yi et al. and the use of attention shifting at multiple levels of abstraction improve inpainting quality by fusing thin and deep configurations, lightweight gated convolutions (LWCs) and low-level convolutions, attention scores Share et al. design a lightweight model for irregular hole filling that can perform inference and inpainting on high-resolution images without taking up a lot of computing power.

(2) Iterative Confidence Feedback and Guided Upsampling

The restoration process is divided into two phases. In the first stage, coarse inpainting results for low-resolution images are obtained by using a coarse-to-fine cascaded network structure. Then, in the fine inpainting stage, the confidence map of inpainting results is introduced to assist iterative correction of unsatisfactory regions to obtain fine inpainting results. The second stage uses a guided inpainting upsampling network to generate HR inpainted images given the LR inpainting results of the first stage. The guided upsampling network consists of two shallow networks, one for learning patch similarity via the patchGAN discriminator and the other for image reconstruction.


Restoration based on prior knowledge

into two categories: contour edge-guided image inpainting and generative prior-guided image inpainting

Contour edge-guided image inpainting

(1) I did

FAII (Foreground-Aware Image Inpainting) is a foreground-aware image inpainting model. As shown in Fig. 8, the model first adopts DeepCut to detect the foreground objects in the image, then uses the edge detector to extract the foreground contours, then applies thick and thin networks to complete the contours, and finally sends the completed contours together with the incomplete image to Another coarse and fine network, which ends up with excellent inpainting results.

The innovation of this method lies in decoupling the process of image structure reasoning and content completion, obtaining the natural contour of the target object, and then using the completed contour as a priori guidance for the incomplete image. It is proposed and validated that using structural priors to explicitly guide image inpainting tasks is a very interesting research direction.

**(2) EdgeConnect **

The method is divided into two stages of image edge detection and image completion. The mask, the grayscale image of the masked original image and the edge image are the inputs of the edge generator to predict the complete edge map. The edge map is used as prior knowledge, and the original image with mask is used as the input of the image completion network to obtain the inpainted image.

The result of the fix for the EdgeConnect method:
insert image description here

EdgeConnect uses an edge generator to generate rough contours in the missing regions and provides prior information on the image structure to the second-stage image completion network. The image completion network only needs to combine the prior fuzzy structure to fill in and repair the details, so as to obtain a complementary image with good structure and texture, which is the innovation of the network. How to generate plausible edges of the lost area in the first stage will be a problem to be solved by this method in the future.

Generative priors guide image inpainting

(1) PGG

In PGG (Prior Guided GAN), the best matching damaged image corresponding to the predicted noise is extracted from the trained offline parametric model as a noise prior, which is sent to the generative model to reconstruct the natural image. Regularize the network by adding a prior on the structure of the target image. A recurrent network is then proposed to help serialize reconstruction, and the model is further extended to high-resolution image inpainting and video restoration.

Among them, image inpainting is regarded as a prior for perceiving the best matching latent code of the target image, and deep learning image inpainting is performed from a new perspective, which is different from inpainting methods that directly train deep encoder-decoder drivers on damaged images. .

(2) DGP

DGP (Deep Generation Prior) utilizes a generative confrontation network pre-trained on large-scale natural images to capture rich image semantic information as a priori, which can obtain richer priors than a single image, including color and spatial consistency , textures, high-level semantics, and more. By using the feature distance obtained by the discriminator for the regularization measure and the progressive fine-tuning strategy of the generator, DGP better preserves the image statistics learned by GAN, thus providing richer recovery and processing effects

Excellent and convincing inpainting results can be obtained in many image processing tasks, such as image colorization, image completion, and super-resolution reconstruction.


Datasets for Image Inpainting

At present, because it is difficult to collect a large number of pairs of real damaged images, researchers often select appropriate image datasets when performing image restoration experiments, and then add corresponding masks to the original data. The most widely used masks mainly include rectangular holes and irregular masks.

Irregular mask dataset:

paper-:

Guilin Liu, Fitsum A. Reda, Kevin J. Shih, Ting-Chun Wang, Andrew Tao, Bryan Catanzaro, Image inpainting for irregular holes using partial convolutions, in: Proc. ECCV, 2018. 3, 4, 6, 7, 85–100.

What is used in is currently the most commonly used mask dataset. Containing 12000 masks and a total of six different hole map area ratios, each category contains 1000 masks with boundary constraints (holes are ensured to be at least 50 pixels from the boundary) and 1000 without boundary constraints.

A partial mask sample is shown below:

Among them, the two on the left have boundary constraints, and the two on the right have no boundary constraints. There are also related researches on these two types, and it is more difficult to restore images without boundary constraints.

Image inpainting dataset:

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-RSYxDHNX-1662377290183)(C:/Users/Husheng/AppData/Roaming/Typora/typora-user-images/ image-20220905170740234.png)]
There is also a data set summarized in a review of image restoration that I have seen before, which is more comprehensive:

For the download of these data, you can go to: https://paperswithcode.com/ to download


analyze

From the research results of image restoration in recent years:

(1) In terms of network selection, the image inpainting method based on convolutional neural network is still the mainstream method for deep learning image inpainting application research.

(2) The generation network mainly includes VAE and GAN, both of which have their own advantages and disadvantages. The training of VAE-based image inpainting methods is usually more stable, and the generated results are prone to blurring. GAN-based image inpainting methods can improve the quality of image inpainting generation, but are difficult to train. Therefore, the image inpainting method based on the combination of VAE and GAN can better balance the shortcomings of these two methods.

(3) At present, although the data characteristics can be better fitted by increasing the width and depth of the network. However, blindly expanding the depth and width of the network will lead to model parameter explosion and training difficulties. Therefore, the current image inpainting network model adopts a thin and deep network structure to reduce and control the number of parameters on the one hand, and on the other hand, it will assist multi-scale feature layers or skip connection residual structures to help solve the problem of gradient disappearance.

(4) In image inpainting tasks, traditional convolutions usually treat all input pixels as valid pixels, which easily leads to artifacts such as color differences and blurring. Therefore, introducing partial convolution or gated convolution can alleviate this problem to a certain extent, so convolution-based improvement is also a feasible breakthrough direction in the field of image inpainting.

(5) When using a coarse-to-fine network model for progressive image inpainting, it is easy to encounter the problem that the network model is too complex. Therefore, it is a meaningful attempt to study how to effectively reuse the specific network layers of the encoder and decoder to control the number of model parameters, thereby improving the training efficiency of the model. For example, coarse and thin paths share their weights to improve each other, and they use the same encoder in PEPSI.

(6) When repairing severely damaged unstructured complex textured objects, multivariate repair can provide a variety of reasonable repair results. A variety of reasonable results can be generated to meet the needs of different situations.

(7) For the repair of high-resolution images: due to the limitation that more resources will be occupied when directly repairing high-resolution images. Therefore, the current mainstream high-resolution image inpainting method will first inpaint the low-resolution image obtained by downsampling the original image. Then, upsample the repaired area or use super-resolution reconstruction to obtain the repaired area at the original image size level, and finally replace the corresponding damaged area, thereby indirectly completing the task of high-resolution image restoration.

(8) Compared with the purely data-driven deep learning image inpainting method, the inpainting method by adding an inference image contour module or structure prior can make full use of the prior knowledge of visible regions for more accurate texture inference. Therefore, complex texture image inpainting based on prior knowledge will be a very meaningful research work when applied to specific scenes.

For some quantitative comparisons on related datasets:

Quantitative comparison of the state-of-the-art methods on the CelebA HQ face dataset, based on the inpainting of the central rectangular area:

[External link image transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the image and upload it directly (img-EnNurAhX-1662377290183) (C:/Users/Husheng/AppData/Roaming/Typora/typora-user-images/ image-20220905185911271.png)]
insert image description here


in conclusion

Although image restoration has made significant progress in recent years, there are still some unresolved tasks, mainly as follows:

(1) Current image inpainting methods can achieve better inpainting results when dealing with regular structure data, small hole inpainting, and low-resolution image inpainting. How to improve the repair effect of complex textures, large holes and high-resolution images?

(2) Network selection and design, image inpainting based on GAN network, autoencoder and the combination of the two are currently the main basic frameworks. How to apply other deep learning models to image inpainting is worth exploring.

Generally speaking, the deeper the network structure, the better the reconstruction and repair effect, but it will also cause problems such as training and convergence difficulties. How to balance the contradiction between the complexity of the network and the quality of the restored image is a problem that needs further research.

(3) How to more effectively combine domain knowledge, prior knowledge and deep learning framework in specific applications to improve the performance of existing deep learning-based image restoration is a direction worth exploring. If we can make full use of domain and prior knowledge to guide the deep learning model, we can not only extract advanced semantic features rich in context information, and then learn more complex mapping relationships between damaged images and repaired images, but also ensure the accuracy of this mapping relationship. Reasonability, thereby further improving the reconstruction performance and interpretability of deep learning-based image inpainting models.

(4) In video stream image restoration, the current image restoration methods based on deep learning benefit from the good spatial feature extraction capabilities of convolutional neural networks, and most of them use deep convolutional neural networks to build network layers. Recurrent neural networks can mine semantic information at the time series feature level of data, and have good applications in the fields of speech and natural language processing. How to effectively combine two kinds of neural networks (Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN)) to process video stream image inpainting will be a very meaningful research direction.

Guess you like

Origin blog.csdn.net/hshudoudou/article/details/126711877
Recommended