[Image Restoration] AOT-GAN "Aggregated Contextual Transformations for High-Resolution Image Inpainting"

contribute

  1. Propose aggregated contextual transformations (AOT) for high-resolution image inpainting, which allows capturing informative long-range context and rich patterns of interest for contextual reasoning.
  2. A new mask prediction task is designed to train a discriminator for image inpainting, so that the discriminator can distinguish real patches from synthetic patches, thus helping the generator to synthesize fine-grained textures.

model structure

the whole frame

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-yESa6D4H-1665994031910) (C:/Users/Husheng/Desktop/%E8%AE%BA%E6%96 %87%E7%AC%94%E8%AE%B0.assets/image-20221016174451614.png)]


AOT block

The generator is first encoded by several standard convolutional layers, then by an AOT block, and finally decoded by a transposed convolution.

AOT blocks adopt a split-transform-merge strategy in three steps:

(1) Splitting: The AOT block splits the convolution kernel of the standard convolution into multiple sub-cores, and each sub-convolution kernel has fewer output channels;

(2) Conversion: Each sub-convolution kernel has a different dilation rate. Larger dilation rates enable sub-kernels to focus on larger regions of the input image, while sub-kernels using smaller dilation rates focus on local patterns with smaller receptive fields.

(3) Aggregation: Context transformations from different receptive fields are finally aggregated by concatenation and standard convolution for feature fusion.
insert image description here

Such a design enables the AOT block to predict each output pixel of the image through different views.

In the following formula, the traditional identical residual connection is improved, and the residual connection is gated. In the aggregation formula, g is a spatially variable gating value. This spatially varying feature aggregation preserves known features outside the missing region while updating features inside the missing region as much as possible.

x 3 = x 1 × g + x 2 × ( 1 − g ) x_{3}=x_{1} \times g+x_{2}{\large }\times(1-g) x3=x1×g+x2×(1g)


Soft Mask-Guided PatchGAN (SM-PatchGAN)

What problem does it solve?

Most deep inpainting models tend to generate an average of all possible solutions based on a reconstruction loss (L1 Loss), which leads to blurry textures.

The inpainting result is expressed as:
z = x ⊙ ( 1 − m ) + G ( x ⊙ ( 1 − m ) , m ) ⊙ mz=x \odot(1-m)+G(x \odot(1-m), m) \odot mz=x(1m)+G(x(1m),m)The result of m
inpainting is the superposition of two parts, the intact area of ​​the original image and the generated hole area. Among them, m is a binary mask (0 means known pixels, 1 means unknown pixels), that is, the missing area is represented as white.

Adversarial loss for the discriminator:

L a d v D = E z ∼ p z [ ( D ( z ) − σ ( 1 − m ) ) 2 ] + E x ∼ p data  [ ( D ( x ) − 1 ) 2 ] \begin{array}{c} L_{a d v}^{D}=\mathbb{E}_{z \sim p_{z}}\left[(D(z)-\sigma(1-m))^{2}\right]+ \mathbb{E}_{x \sim p_{\text {data }}}\left[(D(x)-1)^{2}\right] \end{array} LadvD=Ezpz[(D(z)s ( 1m))2]+Expdata [(D(x)1)2]

Among them, σ \sigmaσ represents the composite function of downsampling and Gaussian filtering.

Generator's adversarial loss:

L a d v G = E z ∼ p z [ ( D ( z ) − 1 ) 2 ⊙ m ] L_{a d v}^{G}=\mathbb{E}_{z \sim p_{z}}\left[(D(z)-1)^{2} \odot m\right] LadvG=Ezpz[(D(z)1)2m]

Design on Discriminator

A soft patch-level mask is designed for the discriminator.

Comparison of different designs:

PatchGAN’s discriminator classifies all patches in the inpainted image as fake, which ignores that the patches outside the missing regions are indeed from real images. While the proposed SM-PatchGAN is able to distinguish the synthetic patches of missing regions from the real patches of the context, which can enhance the discriminator's ability.

No Gaussian filter is used in HM-PatchGAN, thus ignoring that the boundary of the patched image may contain both real and synthetic pixels around the boundary. While the proposed SM-PatchGAN introduces a Gaussian filter to solve this problem.


overall optimization

The optimization function includes four: L 1 L_1L1loss (reconstruction loss), style loss (style loss), perceptual loss (perceptual loss) and adversarial loss (against loss).

(1)L1 L_1L1loss ensures pixel-level reconstruction accuracy
L rec = ∥ x − G ( x ⊙ ( 1 − m ) , m ) ∥ 1 L_{rec}=\|xG(x \odot(1-m), m)\|_ {1}Lrec=xG(x(1m),m)1
(2) The perceptual loss aims to minimize the L 1 L_1 between the activation map of the repaired image and the real imageL1距离
L p e r = ∑ i ∥ ϕ i ( x ) − ϕ i ( z ) ∥ 1 N i L_{p e r}=\sum_{i} \frac{\left\|\phi_{i}(x)-\phi_{i}(z)\right\|_{1}}{N_{i}} Lper=iNiϕi(x)ϕi(z)1
Among them, ϕ i \phi_{i}ϕiActivation map from layer i of a pretrained network (such as VGG19), N i N_iNiIt is ϕ i \phi_{i}ϕiThe total quantity in .

(3) The style loss is defined as L 1 L_1 between the Gram matrix of the inpainted image and the deep features of the real imageL1距离:
L s t y = E i [ ∥ ϕ i ( x ) T ϕ i ( x ) − ϕ i ( z ) T ϕ i ( z ) ∥ 1 ] L_{s t y}=\mathbb{E}_{i}\left[\left\|\phi_{i}(x)^{T} \phi_{i}(x)-\phi_{i}(z)^{T} \phi_{i}(z)\right\|_{1}\right] Lsty=Ei[ ϕi(x)T ϕi(x)ϕi(z)T ϕi(z) 1]
(4)adversarial loss
L a d v = E z ∼ p z [ ( D ( z ) − 1 ) 2 ⊙ m ] L_{a d v}=\mathbb{E}_{z \sim p_{z}}\left[(D(z)-1)^{2} \odot m\right] Ladv=Ezpz[(D(z)1)2m]
Overall optimization goal:
L = λ adv L adv G + λ rec L rec + λ per L per + λ sty L sty L=\lambda_{adv} L_{adv}^{G}+\lambda_{rec} L_{rec}+\lambda_ {per} L_{per}+\lambda_{style} L_{style}L=ladvLadvG+lrecLrec+lperLper+lstyLsty
Parameter setting: λ adv \lambda_{adv}ladv= 0.01, λ rec\lambda_{rec}lrec= 1, λ by \lambda_{by}lper = 0.1, λ s t y \lambda_{sty} lsty = 250。


implementation details

Gaussian filtering in SM-PatchGAN sets the kernel size of the Gaussian kernel to 70×70. To avoid the color shift problem caused by normalization layers, all normalization layers in the generator network are removed.

Training parameter settings:

In a mini-batch, 8 pictures and corresponding masks are randomly selected. The learning rate of both the generator and the discriminator is 1 0 − 4 10^{-4}104 , usingβ 1 = 0 and β 2 = 0.9 \beta_{1}=0 \text { and} \beta_{2}=0.9b1=0 and β  2=0.9 optimizer. Use VGG19 pre-trained on the ImageNet dataset as a pre-trained network for computing style loss and perceptual loss.


experiment

dataset used

Places2、CELEBA-HQ、QMUL-OpenLogo

mask dataset

The mask dataset provided in the paper Image Inpainting for Irregular Holes Using Partial Convolutions is also used in most image restoration tasks.

Model benchmarks for comparison

(1)CA:Context encoders: Feature learning by inpainting. (2016)

(2)PEN-Net:Learning pyramid-context encoder network for high-quality image inpainting. (2019)

(3)PConv:Image inpainting for irregular holes using partial convolutions. (2018)

(4)EdgeConnect:Edgeconnect: Generative image inpainting with adversarial edge learning. (2019)

(5)GatedConv:Free-form image inpainting with gated convolution. (2019)

(6)HiFill :Contextual residual aggregation for ultra high-resolution image inpainting. (2020)

(7)MNPS :High-resolution image inpainting using multi-scale neural patch synthesis. (2017)

The above seven models are all classic models in the field of Image Inpainting.

Evaluation Criteria

L1 L_1L1error、PSNR、SSIM、FID

Then there are qualitative experiments, quantitative experiments, and User Study. The results are definitely better than others, so I won’t summarize them. See the paper for details.

Ablation experiment

Verify the effectiveness of the three components in AOT-GAN: gated contextual transformations (gated context conversion), gated residual connections (gated residual connections), SM-PatchGAN discriminator (SM PatchGAN discriminator).

[External link picture transfer failed, the source site may have an anti-leeching mechanism, it is recommended to save the picture and upload it directly (img-FFFnbluv-1665994031911) (C:/Users/Husheng/Desktop/%E8%AE%BA%E6%96 %87%E7%AC%94%E8%AE%B0.assets/image-20221017153851827.png)]


in conclusion

limitation

(1) The number of branches and the expansion rate of the AOT block are based on empirical research and settings. When the image size changes, it may be necessary to re-set the parameters and cannot be adaptive.

(2) It is difficult to automatically segment the logo area in practical applications (such as logo removal).

Guess you like

Origin blog.csdn.net/hshudoudou/article/details/127365741