High-Resolution Image Inpainting using Multi-Scale Neural Patch Synthesis

Summary

Paper Source: CVPR 2017

Disadvantage of previous methods: the previous method and is based on the semantic context information, and in the filling holes of larger well, the image can be captured more advanced features, but due to memory limitations, and it is difficult to train the network elements, the processing can distinguish rate smaller picture.

The method of paper presented: proposed based on combined image content and texture constraints to optimize the multi-scale neural patch binding methods, which not only retains the context structure, and using the depth classified network most similar characteristic intermediate layer Correlation Patch adjusting and matching a high frequency details.

At Advantage: can handle large resolution images

Network architecture:

Data Set: \ (IMAGEnet \) for the pre-training 16 VGG, \ (Paris-Streetview-Datasets \)

Code:Faster-High-Res-Neural-Inpainting

Introduction

  • image completion:?
  • Existing \ (hole-filling \) solutions to the problems fall into two categories: The first category is dependent on the texture synthesis, \ (hole-filling \) to fill the position by surrounding the lack of texture, common idea is to use similar texture the patch content of coarse to fine synthesized deleted. The second category is a method of data-driven, using the information in the larger database to populate the missing portion.
category In line with the paper
the first sort Reference [14], [13], [27], [26], [6], [12], [40], [41], [23], [24], [2]
Wherein the reference [12], [41], the introduction of a plurality of scales and directions, to find a better match Patch.
Reference [2] proposed nearest neighbor search algorithm will PatchMatch patch as a quick approximate. Benefits: good high-frequency propagation texture detail, but they can not capture an image of the global structure or semantics.
The second category Assumption: the area is surrounded by similar background may have similar content. Pros: When the amount of data enough time, a high success rate. Disadvantages: need to connect to the database, limiting the application of the scene.
  • Recently, the neural network is introduced into the depth of the texture pattern and the image synthesis conversion.
  • Source of inspiration for this paper:
    • Phatak [32] encoderdecoder CNN (Context Encoder) using ℓ2 and adversarial loss region combining the direct prediction image deletion proposed. Cons: Texture details of the deal is not good, when the input larger picture (high resolution picture), hard training adversarial loss.
    • Li and Wand [28] to achieve image pattern conversion, by the neural response to the intermediate layer (neural response) with similar image content, similar to the partial response and partial response pattern of the image layer is low image convolution is optimized. Here partial response (typically 3 * 3) is represented by a small neural patches. This method proved that the pattern of the image transmitted from the high-frequency detail to the content image. But now use more style conversion reference [15] of gram matrices of neural responses.
  • The proposed method:
    • Proposed combination encoderdecoder CNN (Context Encoder) and the ability to predict the structure of neural patches of synthetic high-frequency image of the real ability to implement image restoration tasks.
    • And style conversion task, we train local neural patches encoderdecoder CNN (Context Encoder) as a constraint to global content (global content constraint), and with missing parts and known areas of similarity. (Using pre-trained classification network, patch and image content in response to the deletion of peripheral portion of the intermediate layer, wherein the texture modeling constraint.) These two constraints can be optimized by limited memory BFGS the back-propagation algorithm.
    • To further provides a method of synthesizing multi-scale neural patch, we assume that the image size is 512 × 512, with a deletion of the intermediate portion of 256 × 256, then we create a pyramid structure of three, the number of steps is 2, each step is reduced to the original half of the picture (512 256 128,256 12 864). We then perform a coarse-to-fine-grained filling tasks. In the bottom of the predicted output content network is initialized (1) joint optimization performed on each dimension in order to update the missing portion, and jointly optimized sampling set to initialize the contents of the next constraint on the scale (2). Then repeat this operation until the completion of the joint optimization at the highest resolution.
  • The contribution of this paper:
    • Convolution neural networks were established global and local content constraints texture constraint model, we propose a joint optimization framework.
    • Further introduces a multi-scale neual patcher patching algorithm based on a comprehensive framework for joint optimization of high-resolution images.
    • Studies have shown that, extracted from the neural network wherein the intermediate layer can be used to synthesize realistic texture image content and , in addition, also be used to pass style.

Related work (two inspiration)

Structure Prediction using Deep Networks Over

  • Conventional image generation (GAN) different object images are fixed under conditions known image region, to predict missing content portions. Encoder-decoder for image restoration network structure recently proposed using ℓ2 loss and adversarial loss (Context Encoder) loss function combination. In the paper, we use the Context Encoder forecast as a global content network, using its output to initialize multi-scale neural patch synthesis algorithm.

Style Transfer

  • References [15, 16, 28, 3, 39, 22], demonstrating the successful transfer of nerve style. These methods are mainly by generating a "content" a combined image the "style" of the image and another image. This also indicates that the neural characteristics (neural features) fine texture and high frequency detail in the image generated is also very strong .

method

overall framework

3x5BBq.jpg

3x5GAf.jpg

  • Optimum repaired image optimization loss function \ (X ^ ~ \) , the loss function is composed of three items, including the entire content item (the holistic content term), the local texture item (the local texture term) and TV loss item (the tv-loss term).
    • The entire content item is bound by a global structure, global structure and semantics that captures images. First training Content network, and use it to initialize the entire content items.
    • Local Texture term is local texture statistics of the input image modeling. Calculated using the pre-trained on ImageNet networks VGG-19.
  • Content constraint model: We first training the entire content network \ (f \) (at The Holistic Content Network f), the input network is the image removal center of the rectangular area and fill the average color, and ground truth images \ (x_t \) is the original image the contents of the center of the rectangle. Once the entire content network is trained, we can use the output of the network \ (f (x_0) \) as the initial content of the joint optimization constraints.
  • Local texture item: in order to ensure similar content and deletion of peripheral details of the missing portion. Defined by neural patches similarity (neural patches has been successfully applied to the style of the captured image.) In order to optimize the local texture items, the image \ (X \) is input to the pre-trained VGG network (local texture network) and the network wherein the predetermined level, the deletion region of the small (typically 3 × 3) outside nerve block nerve block in response to the lack of similarity. In fact, we use the combination of relu4_1 relu3_1 layer neural identity is calculated. We use the limited memory BFGS by minimizing the loss of joint content and texture to iteratively update \ (the X-\) .
  • Multi-scale problems: multi-scale restoration in order to achieve high-resolution images for a given range of a loss of a large high-resolution images, we first reduce the image, and then use to predict the content network to get references. Then, for a given reference content, we optimized at low resolution (that is, content and texture constraints). Then the optimization results upsampled and used as optimization of the fine-scale initialized.

Loss of joint function (The Joint Loss Function)

  • Input image \ (x_0 \) , the output image \ (X \) .
  • R represents an output image \ (X \) of the missing portion, \ ([Phi] R ^ \) represents the area of the portion wherein VGG-19 network map φ (x) corresponding deletion.
  • h (·) denotes the operation to extract sub-images or sub-features in the rectangular area map, i.e., H (x, R) represents the color contents in x R region, \ (H ([Phi] (x), R ^ [Phi]) \ ) represents \ (φ (x) \) in \ (R ^ φ \) content area.
  • Constraint network content (the content network) referred to as \ (F \) , texture constraint network (the textture network) referred to as \ (T \) .
  • The image reduction ratio \ (I \) = 1,2, ....., N (N is the number of scaled down), the optimal reconstruction (hole filling) Results \ (X ^ ~ \) , can be solved by the following minimization problem to achieve:
    • 3j49s0.jpg
    • Wherein, \ (H (x_1, R & lt) = F (x_0) \) , \ (φ_t (X) \) represents the local texture network \ (T \) characteristic map intermediate layer (feature map) (or characteristic map combination), α is a reflection of the importance of the right weight between these two terms. α and β to 5e-6 to balance the loss of each size.
    • Three of explanation loss function: \ (E_C \) , \ (E_T \) and \ (gamma] \)
      • \ (E_c \) is modeled as the overall content of the constraints , to punish the optimization results to differ between the predicted (or optimization results from the content network coarser scale) before \ (l_2 \) differences.

        • 3jjlwj.jpg
      • \ (E_T \) is modeled as a local texture constraints , to punish difference inside and outside the missing portion of the textured appearance.
        • First, the network \ (T \) selecting a feature in the layer (or combination of features layer) and extract feature map \ (φ_t \) , for the deletion region \ (R & lt \) each \ (s × s × c \) local query block size P, we find the most similar to the missing portion of the outer blocks, and calculates the loss by an average distance from its nearest neighbor query block.
        • 3vSzrT.jpg
        • \ (| R ^ φ | \ ) is the region \ (R ^ φ \) number of blocks of samples, \ (P_i \) is the position \ (I \) is a partial nerve center of the block (local neural patch), \ (nn (i) \) is calculated as:3xBvxU.jpg
          • \ (N (i) \) is \ (I \) and (R & lt \) \ collection of overlapping adjacent positions.
      • TV loss goal is to make the image smoother.

        • 3xDfoR.jpg

The Content Network

  • A simple way to learn the contents of the initial prediction network (content prediction network) network is trained Regression \ (F \) using the input image \ (X \) response (having unknown area) \ (f (x) \) approximated in the region \ (R & lt \) Ground Truth at \ (x_g \) .
  • We experiment \ (L_1 \) Loss and adversarial loss.
  • For images of each training, \ (L_2 \) Loss is defined as:3xcu59.jpg
  • adversarial loss is defined as:3xc1v6.jpg
  • We use the same method Encoder Context, \ (L_2 \) mode loss and adversarial loss combination:3xgdwF.jpg
    • Take 0.999 λ

The Texture Network

  • Our experiments ImageNet Classification of networks VGG-19 pre-trained network as a texture (the texture network), and using \ (relu3-1 \) and \ (relu4-1 \) layer texture calculate the local constraint (the local texture term). Calculated with two layers would be better than single layer calculation results.
  • The reason for using networks VGG-19: VGG-19 network after training semantic classification, thus characterized in which the interlayer has a strong invariance (texture distortion). This helps to infer the missing part of a more accurate reconstruction.

Experiments

Visualized and quantified assessment. We first introduced the data set, and then compared with other methods to prove the effectiveness of this method in high-resolution image repair. In this final part, we present a real-world application, in this application, we can remove disturbances from the photo

  • DataSets: Paris StreetView and ImageNet (do not use labels).
    • paris streeview: contains 14,900 images and 100 training test picture.
    • ImageNet: training, including 1.26 million pictures and 200 randomly selected from the centralized verification picture.
  • Experimental Settings: In the case of low resolution (128 × 128), the first of our approach with several standard methods are compared.
    • First, we will adopt the results of the \ (L_2 \) Loss of context encoder are compared.
    • The second best results, we will approach our context encoder made use confrontational loss were compared, it is the use of deep learning for image repair in the field of the latest technology.
    • Finally, we AdobePhotoshop in PatchMatch algorithm to compare the results of content-aware fill. We compare demonstrate the effectiveness of the joint optimization of the proposed framework.
      • By comparison with the reference method, explained the role of the overall effectiveness of joint optimization algorithm and texture network in the United optimization, and further analysis of the role of the content network separation and texture in a joint network optimization.
      • Finally, we present the results of the high-resolution image restoration, and compared with Content-Aware Fill and Context Encoder (ℓ2 and adversarial loss). Note that the context for the encoder, a high resolution result is obtained by directly sampling the output from the low resolution obtained. Our method showed a significant improvement in the quality of vision.
  • Quantitative Comparisons
    • Low resolution (128 × 128) an image on a street in Paris dataset, our method we will compare the reference side. The results in Table 1 show that our approach received the highest numerical performance. We attribute this to the nature of our approach - it can be inferred that the correct configuration of the image when the Content-Aware Fill fail, as compared with the results of Context Encoder, can also be synthesized better image detail (FIG. 4). Moreover, given the task of repairing the goal is to generate realistic content, rather than generating the original image in the exact same content, quantitative assessment may not be the most effective remedial measures.3z91AS.jpg
  • The effects of content and texture networks
    • We did a study is to remove the constraints content item (the content constraint term), only the texture item in a joint optimization. 8, without the use of content items to guide optimization, structural repair results are completely erroneous. We also adjust the relative weights between content items and textures items. We found that by using more content heavy constraint weights, the result is more in line with the initial forecast content network, but may lack the high-frequency detail. Similarly, items can be used more texture results obtained clear, but can not guarantee that the entire image is the correct structure (Fig. 6).
  • The effect of the adversarial loss
    • We analyzed the effect of using confrontational loss in the training content network (the content networks) in. One would think, without the use of confrontational loss, the content network is still able to predict the structure of the image, jointly optimizing the calibration texture later. However, we found the quality of the content network initialization given very important for the final result. When the initial forecast is vague (use only "l2 loss") when compared to the same time using the "l2 loss" and "against loss" training content network, the end result has become more blurred (Figure 7).
  • High-Resolution image inpainting
    • We show in FIG. 5 and FIG. 10 the high resolution image (512 × 512) to repair the results, and the Content-Aware Fill Encoder and the Context ( \ (1_2 \) compared loss + adversarial loss). Since the Context Encoder only applies to 128x128 image, and when the input is large, we use bilinear interpolation to the samples directly to the output 512 × 512 128 × 128 of. In most of the results, we multiscale iterative method combines the advantages of other methods, produce a result having a coherent global structure and the high frequency detail. As shown, compared with the Content-Aware Fill, a significant advantage of our approach is that we can generate a new texture, because we do not directly use the conventional patch repair. However, a disadvantage is that, based on our current implementation, our algorithm takes about 1 minute, filled with 512 × 512 256 × 256 images holes with Titan X GPU, this perception is much slower than the content filled.
  • Real-World Distractor Removal Scenario
    • Finally, our algorithm can easily be extended to deal with the missing section of arbitrary shape. We first use the missing portion surrounded by a rectangle to cover any of the missing part once again filled with the average pixel value. After appropriate cutting and filling the rectangle is located in the center, the image as an input content network. In the joint optimization, content constraints (Content The
      constraint) is the output of the network by the content of any portion missing initialized. Constraint-based texture region outside of the partial deletion. FIG 11 illustrates several examples of sensing and comparison with the content filling algorithm (note, the encoder context (Context Encoder) can not deal explicitly with any missing portion, so we will not be compared with this).

Conclusion

The latest progress in the use of synthetic nerve block (neural patch synthesis) in terms of semantics repair We have presented. Found, texture network (the texture network) in producing very strong high-frequency detail, while the network content (the content network) has a strong priori and global semantic structure. This could potentially be useful for other applications, such as noise removal, super-resolution, retargeting and view / temporal interpolation. When the complexity of the scene, our method will introduce discontinuities and artifacts (FIG. 9). In addition, the speed is still a bottleneck of our algorithm. Our goal is to resolve these issues in future work.

Guess you like

Origin www.cnblogs.com/wenshinlee/p/12444785.html