DOA-GAN: Dual-Order Attentive Generative Adversarial Network for Image Copy-move Forgery Detection and Localization

Source: 2020

Author: Snipaste_2020-04-14_14-12-59

Abstract: (copy and paste detection) GAN with a dual-order attention model

  1. Builder:

    First order attention capture copy paste location information

    The second order attention is to find discriminative features for patch co-occurence

  2. Attention maps are extracted from the affine matrix and used to fuse location-aware and co-occurrence features

  3. Discriminator

    Ensure more accurate positioning results

no

数据集 USC-ISI CMFD dataset [46]、 the CASIA CMFD dataset [46]、the CoMoFoD dataset [41]、COCO dataset [21].

lab environment

metrics copy presion, recall, F1 score

Splicing IOU F1 MCC

introduction

Patch-based method 8 32 17

Keypoint-based approach 49 33

Method based on irregular regions 19 36

Deep learning methods 44 22 46

       This paper proposes a dual-order attentive Generative Adversarial Network.

       Input the image, and then calculate the affine matrix based on the feature vector extracted per pixel. Design double order attention, generate 1st-order attention map \ (A_1 \) to explore the location information of tampering, 2nd-order attention map \ (A_2 \) to capture the more accurate patch interdependence, and calculate the final by two attention maps The characteristic representation of, and then input the detection branch-> output detection confidence score, input positioning branch-> output prediction mask, mark the source / target area

       The discriminator discriminates whether the prediction result and ground truth are consistent

The affine matrix contains statistical data of 2nd-order features, which inspired us to explore \ (A_2 \) to distinguish tampered regions from accidental target texture similarity. The high value of non-diagonal elements indicates the high similarity of the copy-movement spatial relationship between patches, which inspired us to explore \ (A_1 \) to focus on tampering regions. In this paper, the affine matrix is ​​refined and regularized, and the top-k value of each column is taken to form a 3D tensor of the k channel. Input the tensor into CNN to form 1st-order attention map \ (A_1 \) and pay attention to the source and target areas.

The contributions of this article are as follows:

  1. Propose a double-sequence attention confrontation generation network to detect and locate image copy and paste tampering
  2. The 1st-order attention model extracts the attention map of the tampered area, and the 2nd-order attention model extracts the pixel dependence. Provide more discriminative features
  3. Extended experiments show the superior performance of our method

method

Snipaste_2020-04-15_09-45-01

The generator is an end-to-end unified structure to complete the detection and positioning tasks.

Input image I, use the first four layers of VGG19 to extract layered features, resize them to the same size, and then stitch them into \ (F_ {cat} \) .

Calculate the affine matrix and get the attention graph \ (A_1 \) \ (A_2 \) after the double sequential attention model .

ASPP-1 and ASPP-2 with different parameters extract context features \ (F_ {aspp} ^ 1 \) \ (F_ {aspp} ^ 2 \) .

\ (A_1 \) respectively \ (F_ {ASPP} ^. 1 \) \ (F_ {ASPP} ^ 2 \) pixels determined by multiplying \ (F_ {ATTEN} ^. 1 \) \ (F_ {ATTEN} ^ 2 \) .

\ (A_2 \) and \ (F_ {atten} ^ 1 \) \ (F_ {atten} ^ 2 \) do matrix multiplication to get \ (F_ {cooc} ^ 1 \) \ (F_ {cooc} ^ 2 \ ) .

The four features are fused and input into the detection branch to obtain the detection score (the possibility of tampering?). Enter the positioning branch to get the mask. (The inputs to the discriminator are I and M)

3.1 generator network

Input the image \ (I \ in \ R ^ {H \ times W \ times 3} \) , extract the feature map through the first three layers of VGG19, then resize it to the same size, and stitch it into \ (F_ {cat} \ in h \ times w \ times d \) , let \ (h = \ frac {H} 8 \) , \ (w = \ frac {W} 8 \) , calculate the affine matrix, formula 1

Snipaste_2020-04-15_12-06-21

Where \ (F_ {cat} ^ {'} \ in \ R ^ {hw \ times d} \)

The Dual-Order Attention Module

Snipaste_2020-04-15_12-13-45

提取 the copy-move aware region attention map \(A_1%\) 和 the co-occurrence attention map \(A_2\)

When autocorrelating an image, the diagonal value of the affine matrix is ​​relatively large.

Snipaste_2020-04-15_12-18-29

Equation 2, G uses Gaussian kernel to weaken the correlation of the same part of the image, and obtain a new affine matrix $ S ^ {'} = S \ bigodot G $

Using the patch-matching strategy of [6], calculate the probability that the patch in row i of \ (S ^ {'} \) matches the patch in column j.

Snipaste_2020-04-15_15-15-53

\ (\ alpha \) is a trainable parameter, initialized to 3. \ (L \ in \ R ^ {hw \ times hw} \) is the final affine matrix. Take the value of top-k from L, and then reshape into \ (T \ in \ R ^ {h \ times w \ times k} \) , and input T into the attention model, as shown in the schematic diagram

Snipaste_2020-04-15_15-35-57

Atrous Spatial Pyramid Pooling (ASPP) Block

Used to extract contextual information, this article finds that two ASPPs can effectively learn two tasks, source and target detection

Feature Fusion

Combining copy-move region aware attentive features and co-occurrence features, these four feature vectors can make good use of the dependencies between patches, and based on the similarity measure, distant pixels can contribute to the feature response at the location

Snipaste_2020-04-15_16-29-11

Snipaste_2020-04-15_16-29-21

Snipaste_2020-04-15_16-30-02

The merge operation refers to concatenation

Detection Branch and Localization Branch

3.2 Discriminator Network

The structure of the discriminator is based on the Patch-GAN discriminator [18]. The discriminator is used to predict whether the patch of each NXN in the image is true or false

The inputs of the discriminator are I and mask M, the discriminator is trained to distinguish between ground truth mask and predicted mask, and the generator attempts to deceive the discriminator.

3.3 Loss Functions

The loss function is calculated from confrontation loss, cross entropy loss and detection loss

Snipaste_2020-04-15_16-56-06

Adversarial Loss

Snipaste_2020-04-15_16-56-35

Discriminator D strives to maximize the goal, and generator G strives to minimize the goal

Snipaste_2020-04-15_17-16-35

Cross-Entropy Loss

Snipaste_2020-04-15_17-26-37

\ (\ widehat {M} = G (I) \) is the prediction mask of the generator, M is the ground truth mask

Detection Loss

Is the binary cross-entropy loss between the score of the detection branch and the true label

Snipaste_2020-04-15_17-30-15

The image contains tampering, \ (y_m = 1 \) . No, \ (y_m = 0 \) . \ (\ hat m_ {im} \) is the output of the detection branch

3.4 Implementation Details

Feature extraction uses the first three layers of VGG19 pre-trained on imagenet

ASPP is used according to DeepLabV3 + [5]

In the first order of attention, k = 20

The generator learning rate is 0.001, the discriminator learning rate is 0.0001, the VGG19 learning rate is 0.0001, and the learning rate is halved after 5 epoches. During training, first optimize only the cross-entropy loss of the 3 epoch generator, and then optimize all losses. When discriminator loss = 0.3, freeze the discriminator until the loss increases again. This ensures that the generator and discriminator learn at a similar speed, while the discriminator is not over-trained.

4 Experimental Results

Data set :

USC-ISI CMFD dataset [46]: 80K, 10K, and 10K images for training, verification, and testing

the CASIA CMFD dataset [46]: There are 1313 tampered images and corresponding 1313 original images

and the CoMoFoD dataset [41]: There are 5000 tampered images, 200 basic images, 25 operation categories including 5 kinds of operations and 5 post-processing methods

experiment

Image level for detection, pixel level for positioning, indicators are precion recall F1 score, 3 categories Pristine (background), Source, and Target

4.1 Experiments on the USC-ISI CMFD dataset

Comparison method :

BusterNet [46]

ManTra-Net [47]

U-Net [38]

NA-GAN (no attention)

FOA-GAN (only 1st-order attention)

SOA-GAN (only 2nd-order attention)

DOA-GAN w / o \ (L_ {adv} \) (removal of confrontation loss)

DOA-GAN w / o \ (L_ {det} \) (remove detection loss)

 

DOA GAN training using coco's 10k original images and USC-ISI's 80k images

Pixel-level evaluation, only the average precision, recall, F1 score of each image is calculated on the tampered image (excluding the original image),

Image-level evaluation, using 20k images (with tampering and non-tampering)

If the predicted branch output score is greater than 0.5, it is considered tampering

For BusterNet and DOA-GAN w / o \ (L_ {det} \) , if the output mask has> 200 pixels as the source or target area, the image is considered to be tampered

Table 1 is the positioning result

Snipaste_2020-04-15_18-25-51

This article generates three-channel results

Table 2 is the test results

Snipaste_2020-04-15_18-24-43

Find

  1. All indicators of DOA-GAN w / o \ (L_ {adv} \) are better than busternet
  2. DOA-GAN is better than DOA-GAN w / o \ (L_ {adv} \) , indicating that DOA ’s discriminator has good discrimination ability
  3. DOA-GAN is better than DOA-GAN w / o \ (L_ {det} \) , indicating that DOA ’s det loss is good
  4. In addition to the f1 score on pristine, DOA-GAN is better than FOA-GAN and SOA-GAN, indicating that two attentions complement each other to improve detection and positioning performance
  5. DOA-GAN, SOA-GAN, and FOA-GAN are better than U-Net and NA-GAN, indicating the effectiveness of affine calculation

Figure 5 is the visualization result, DOA is the best. The penultimate is the DOA result, the last one is the ground truth

Snipaste_2020-04-15_18-41-00

4.2 Experiments on the CASIA CMFD dataset

CASIA does not provide the source and target tags, so the last convolutional layer of DOA is replaced with a 1-channel output

Using coco and USC-ISI dataset to train DOA GAN and busternet

Comparison method : the first three are traditional methods

a block-based CMFD with Zernike moment features (denoted as “Block-ZM”) [39]

an adaptive segmentation based CMFD (denoted as “Adaptive-Seg”) [36]

a discrete cosine transform (DCT) coefficients based CMFD (denoted as “DCT-Match”) [12]

densefield [8]

busternet

Pixel-level performance, calculate prf for each positive image

Image level detection, output mask> 200 pixels tampering, think this image is tampered, use tampered image and corresponding real image for detection,

Table 3 is the result

Snipaste_2020-04-15_21-02-15

The result of buester in the table is different from 46, because this article only trains busternet on the above data set

Figure 6 is the visualization results (what is densefield?)

Snipaste_2020-04-15_21-16-07

4.3 Experiments on the CoMoFoD dataset

Table 4 is the result

Snipaste_2020-04-15_21-11-32

There are many operations and post-processing in this data set such as rotation scaling deformation compression fuzzy noise, etc.

Figure 8 is the f1 score,

Snipaste_2020-04-15_21-18-09

Figure 9 is to detect the correct number

Snipaste_2020-04-15_21-18-25

Pixel f1 score> 30%, considered correct detection

4.4 Discussion

Disadvantages: from copying a part of the background to the same background, it is not easy to detect. The zoom level is too large to detect

Snipaste_2020-04-15_21-22-25

The first background is too the same, the second tampering area is small

5 Extension to Other Manipulation Types

DOA is to calculate the affine matrix on the same image, which can be easily extended to calculate the matrix in two images (splicing and tampering, video copy and paste)

Comparison method

DMVN [45]

DMAC[23]

Splicing detection: training on the generated data set of 23, testing on ms-coco, Table 5 is the result

Snipaste_2020-04-15_21-27-34

Video copy and paste: regarded as inter-frame stitching

Video target segmentation data set:

DAVIS [34]

SegTrackV2 [42]

Youtube-object [35]

Table 6 is the result, DOA is the best

Snipaste_2020-04-15_21-30-56

in conclusion:

Future research on co-saliency location detection and image-level tampering of satellite images

Guess you like

Origin www.cnblogs.com/qina/p/12727060.html