Partial convolution && Gated convolution

group meeting discussion thread


1. Image Restoration

Image inpainting (inpainting), as the name suggests, is to repair the damaged part of the image. It is an image editing technology that can be applied to removing objects, repairing old photos, image completion (eg, earthquake interpolation) and so on.

insert image description here
insert image description here
insert image description here

insert image description here

2. Partial convolution

Link to the paper: Image Inpainting for Irregular Holes Using Partial Convolutions (2018 ECCV)

The previous deep learning image completion methods used ordinary end-to-end CNN to do it, that is, the damaged image was used as input and the complete image was used as a label for learning. When ordinary convolution ( vanilla convolution ) acts on the damaged area of ​​the image, most of the calculations are wasted, because the pixels in the damaged area are 0 or 1; at the same time, the convolution kernel cannot distinguish between damage and Undamaged regions are not sensitive to the information difference between the two parts.
Pconv participates in the convolution operation by adding a mask mask, which greatly improves the operation efficiency, and distinguishes the pixels in the damaged area from the undamaged area, improving its sensitivity.

Why are traditional convolutions inappropriate for image inpainting [free-form] tasks?
Traditional convolution is to calculate each pixel as an effective value. This feature is used for classification and detection tasks. In these tasks, all pixels in the input image are valid, and then ordinary convolutions extract local features in a sliding window manner. But for the inpainting task, there will be some holes in the repaired image (that is, the pixels inside are invalid), and it is necessary to distinguish the content inside and outside.

Partial convolutional layer:
x ′ = { WT ( X ⊙ M ) sum ( 1 ) sum ( M ) + b , if sum ( M ) > 0 0 , otherwise x' = \begin{cases} \mathbf{W}^{ . T}(\mathbf{X} \odot \mathbf{M}) \frac{\text{sum}(\mathbf{1})}{\text{sum}(\mathbf{M})} + b, & \text{if sum}(\mathbf{M}) >0 \\ 0, & \text{otherwise} \\\end{cases}x={ WT(XM)sum(M)sum(1)+b,0,if sum(M)>0otherwise
where X \mathbf{X}X is the feature value (pixel value) of the current convolution (sliding) window,M \mathbf{M}M is the corresponding binary mask,WWW is the convolution kernel weight,bbb is the bias,⊙ \odot means pixel by pixel multiplication. For M \mathbf{M}in the first layer of PconvM , 1 represents undamaged area, 0 represents damaged area.
as the picture shows:
insert image description here

Mask value:
m ′ = { 1 , if sum ( M ) > 0 0 , otherwise m' = \begin{cases} 1, & \text{if sum}(\mathbf{M}) >0 \\ 0, & \text{otherwise}\\\end{cases}m={ 1,0,if sum(M)>0otherwise
That is, M \mathbf{M}As long as one element in M ​​is 1, then the updatedm ′ m'm' is 1; otherwise, it is 0. as the picture shows:
insert image description here

2.1 Loss

The image with holes is I in \mathbf{I}_{in}Iin, the initial binary mask is M \mathbf{M}M , the network predictsI out \mathbf{I}_{out}Iout, the real image is I gt \mathbf{I}_{gt}Igt
Per-pixel losses :
L hole = 1 NI gt ∥ ( 1 − M ) ⊙ ( I out − I gt ) ∥ 1 \mathcal{L}_{hole}=\frac{1}{N_{ \mathbf{I}_{gt}}} \lVert(1-M)\odot(\mathbf{I}_{out} - \mathbf{I}_{gt}) \rVert_1Lhole=NIgt1∥(1M)(IoutIgt)1
L v a l i d = 1 N I g t ∥ ( M ⊙ ( I o u t − I g t ) ∥ 1 \mathcal{L}_{valid}=\frac{1}{N_{\mathbf{I}_{gt}}} \lVert (M \odot(\mathbf{I}_{out} - \mathbf{I}_{gt})\rVert_1 Lvalid=NIgt1∥(M(IoutIgt)1

I c o m p \mathbf{I}_{comp} IcompRepresents the original output I out \mathbf{I}_{out}IoutBut the non-hole pixels are set to the ground truth image.
Ψ p I ∗ \Psi^{\mathbf{I^*}}_pPspIRepresents the image I ∗ \mathbf{I^*}Ippofp -layer activation mapping (projection to higher-level feature space using ImageNet pre-trained VGG-16)
Perceptual loss (perceptual loss):
insert image description here

Assuming advanced features Ψ p I ∗ \Psi^{\mathbf{I^*}}_pPspIThe shape is H p W p × C p H_p W_p \times C_pHpWp×Cp, can get a C p × C p C_p \times C_pCp×CpGram matrix, K p K_pKpfor the ppThe normalization factor of the p layer ( 1 C p H p W p \frac{1}{C_p H_p W_p}CpHpWp1)。

Sytle-loss term(风格损失项):
insert image description here
Total variation loss(全变差损失)
L t v = ∑ ( i , j ) ∈ R , ( i , j + 1 ) ∈ R ∥ I c o m p i , j + 1 − I c o m p i , j ∥ 1 N I c o m p + ∑ ( i , j ) ∈ R , ( i + 1 , j ) ∈ R ∥ I c o m p i + 1 , j − I c o m p i , j ∥ 1 N I c o m p \mathcal{L}_{tv} = \sum_{(i,j)\in R,(i,j+1)\in R} \frac{\lVert \mathbf{I}_{comp}^{i, j+1} - \mathbf{I}_{comp}^{i, j} \rVert_1}{N_{\mathbf{I}_{comp}}} + \sum_{(i,j)\in R,(i+1,j)\in R} \frac{\lVert \mathbf{I}_{comp}^{i+1, j} - \mathbf{I}_{comp}^{i, j} \rVert_1}{N_{\mathbf{I}_{comp}}} Ltv=(i,j)R,(i,j+1)RNIcompIcompi,j+1Icompi,j1+(i,j)R,(i+1,j)RNIcompIcompi+1,jIcompi,j1
总损失
L total = L valid + 6 L hole + 0.05 L perceptual + 120 ( L styleout + L stylecomp ) + 0.1 L tv \mathcal{L}_{total}=\mathcal{L}_{valid} + 6 \mathcal{L}_{hole}+0.05\mathcal{L}_{perceptual}+120(\mathcal{L}_{style_{out}}+\mathcal{L}_{style_{comp}}) + 0.1 \mathcal{L}_{tv}Ltotal=Lvalid+6 Lhole+0.05Lperceptual+120(Lstyleout+Lstylecomp)+0.1Ltv

2.2 Its application in super-resolution tasks

insert image description here
The input to the network is constructed from low-resolution images by offsetting pixels and inserting holes. The details are as follows:
Let the low-resolution image be III , the size isH × WH \times WH×W , the scale factor isKKK. _
Let the network input image beI ' I'I ' , its size is( K ∗ H ) × ( K ∗ W ). (K*H) \times (K*W).(KH)×(KW ) . forIIEach pixel in I ( x , y ) (x,y)(x,y ) , put it inI ′ I'I中的 ( K ∗ x + ⌊ K / 2 ⌋ , K ∗ y + ⌊ K / 2 ⌋ ) (K*x + \lfloor K/2 \rfloor, K*y +\lfloor K/2 \rfloor) (Kx+K/2,Ky+K /2 ⌋) , and the correspondingM \mathbf{M}The value at the M position is 1.

3. Gated convolution

Paper link: Free-Form Image Inpainting with Gated Convolution (ICCV 2019)

What are the disadvantages of partial convolution?
1. When updating the Mask, it heuristically classifies spatial locations as valid and invalid. No matter how many pixels are covered by the filter range of the previous layer, the mask of the next layer will be set to 1 (for example, 1 valid pixel and 9 valid pixels are treated as the same to update the current mask) , which seems unreasonable.
2. All channels in each layer share the same mask, which limits flexibility. Essentially, partial convolutions can be viewed as non-learnable hard gating of single-channel features.

Partial convolution and gated convolution:
insert image description here

Gated convolution layer:
G a t i n g y , x = ∑ ∑ W g ⋅ I Gating_{y,x}= \sum \sum W_g \cdot I Gatingy,x=∑∑WgI
F e a t u r e y , x ∑ ∑ W f ⋅ I Feature_{y,x}\sum \sum W_f \cdot I Featurey,x∑∑WfI
O y , x = ϕ ( F e a t u r e y , x ) ⊙ σ ( G a t i n g y , x ) O_{y,x}=\phi(Feature_{y,x}) \odot \sigma(Gating{_{y,x}}) Oy,x=ϕ(Featurey,x)σ(Gatingy,x)
Among themW g W_gWg W f W_f WfIndicates the weight of the corresponding convolution kernel, III is the feature map,ϕ \phiϕ can be any activation function (such as ReLU), andσ \sigmaσ represents the sigmold function.

Gated convolutions allow the network to learn a dynamic feature selection mechanism for each channel and each spatial location . Interestingly, the visualization of intermediate gating values ​​shows that it not only selects features based on background, mask, sketch, but also considers semantic segmentation of certain channels. Even at deep layers, gated convolutions learn to highlight mask regions and sketch information in different channels to better generate inpainting results.
insert image description here

3.1 Loss

电影器:
LG = − E z ∼ P z ( z ) [ D sn ( G ( z ) ) ] \mathcal{L}_{G}=-\mathbb{E}_{z\sim\mathbb{P} _z(z)}[D^{sn}(G(z))]LG=EzPz(z)[Ds n (G(z))]
风别器
LD sn = E x ∼ P data ( x ) [ R e LU ( 1 − D sn ( x ) ) ] + E z ∼ P z ( z ) [ R e LU ( 1 + D sn ( G ( z ) ) ) ] \mathcal{L}_{D^{sn}}=\mathbb{E}_{x\sim\mathbb{P}_{data}(x)} [ReLU(\mathbb{1}-D_{sn}(x))]+\mathbb{E}_{z\sim\mathbb{P}_z(z)}[ReLU(\mathbb{1}+D^ {sn}(G(z)))]LDsn=ExPdata(x)[ R e LU ( 1Dsn(x))]+EzPz(z)[ R e LU ( 1+Dsn(G(z)))]

3.2 Gconv implementation code

class GatedConv2d(nn.Module):
    """
        Gated Convlution layer with activation (default activation:LeakyReLU)
        Params: same as conv2d
        Input: The feature from last layer "I"
        Output:\phi(f(I))*\sigmoid(g(I))
        """

    def __init__(self, in_channels, out_channels, kernel_size, stride=1, padding=0, dilation=1, groups=1, bias=True,
                 batch_norm=True, activation=torch.nn.LeakyReLU(0.2, inplace=True)):
        super(GatedConv2d, self).__init__()
        self.batch_norm = batch_norm
        self.activation = activation
        self.conv2d = torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding, dilation, groups, bias)
        self.mask_conv2d = torch.nn.Conv2d(in_channels, out_channels, kernel_size, stride, padding, dilation, groups,
                                           bias)
        self.batch_norm2d = torch.nn.BatchNorm2d(out_channels)
        self.sigmoid = torch.nn.Sigmoid()

        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.kaiming_normal_(m.weight)

    def gated(self, mask):
        return self.sigmoid(mask)

    def forward(self, input):
        x = self.conv2d(input)
        mask = self.mask_conv2d(input)
        if self.activation is not None:
            x = self.activation(x) * self.gated(mask)
        else:
            x = x * self.gated(mask)
        if self.batch_norm:
            return self.batch_norm2d(x)
        else:
            return x
          

Gated convolution and attention mechanism:
similarities:
1. Contextual information capture: Both aim to capture long-distance dependencies and contextual information of input data by introducing mechanisms to improve the model's ability to understand data.
2. Adaptability: Both the gated convolution and the attention mechanism are adaptive, and can automatically learn weights or importance according to different parts of the input data for weighted processing.
Differences:
1. Application field: Gated convolution is mainly used in image processing tasks in convolutional neural networks (CNN), such as image description generation. The attention mechanism is widely used in models such as recurrent neural network (RNN) and Transformer for processing sequence data, language modeling, machine translation and other tasks.
2. Mechanism: Gated convolution weights the input by introducing gating units (such as update gates and reset gates) to adjust the weight of the convolution operation. The attention mechanism weights different parts of the input according to how relevant the input is to the context by calculating attention weights.
3. Locality: Gated convolution pays more attention to the local area of ​​the input data, and performs weighted processing on the local area through the convolution operation; while the attention mechanism pays more attention to the global context information, and dynamically adjusts the weighting weight according to the relationship between the input and the context.


References:
https://zhuanlan.zhihu.com/p/519446359
https://www.cnblogs.com/wenshinlee/p/12591947.html
https://blog.csdn.net/weixin_43135178/article/details/123229497
https://cloud.tencent.com/developer/article/1759006
https://blog.csdn.net/yexiaogu1104/article/details/88293200?ydreferer=aHR0cHM6Ly9ibG9nLmNzZG4ubmV0L3dlaXhpbl80MzEzNTE3OC9hcnRpY2 xlL2RldGFpbHMvMTIzMjI5NDk3

Guess you like

Origin blog.csdn.net/weixin_48320163/article/details/130751989