[Interpretation of the paper] Camouflaged Object Detection

Camouflaged Object Detection

SINet is available in two versions

SINet-v1 published in CVPR2020

Paper address: Camouflaged_Object_Detection_CVPR_2020

Code address: SINet-v1 code

SINet-v2 published on IEEE TPAMI 2021

Paper Address: Concealed Object Detection

Code address: SINet-V2

Compared with the v2 version, the v1 version has some changes in the network
structure

v1 network structure:

insert image description here

v2 network structure:

insert image description here

SINet v1

The main contribution of SINet is the production of the COD10K data set, and the pit of camouflaged object recognition.

SINet v1 does not have much innovation in the network structure, the structure is mainly based on the CPD framework

It is recommended to read this article before reading the v1 structure:

  • Cascaded Partial Decoder for Fast and Accurate Salient Object Detection

This article is an article on salient target detection from CVPR in 2019

This paper mainly proposes a new cascaded partial decoder (CPD) framework for fast and accurate salient object detection.

The RF module, SA module, and PDC module used in SINet are all modules in the copied CPD framework

And the double-branch structure used by SINetv1 is also the double-branch structure of the copied CPD

The basic structure is the same, but the low-level features are not discarded.

You can read another blog about the CPD framework: PDC module, F module, and SA module are all explained here

https://zpf1900.blog.csdn.net/article/details/127429430

The structure of the entire network is also modeled on the structure of CPD, with double branches

Although the author divides it into two parts and named them Search Module (SM) and Identification Module (IM), it is actually the dual-branch structure of CPD.

So, naming is an art

The backbone network uses ResNet50, and the features of the five convolutional blocks are not discarded

The first branch uses the PDC to fuse the features of the five convolutional blocks through the RF module

The second branch sends the feature map of the third module to SA, and then passes through the RF module together with the feature maps of the fourth and fifth convolution blocks and then sends it to the PDC to obtain an enhanced map.

The two branches are jointly trained using the cross-entropy loss function

I have written about the specific network details in the CPD blog, so I won’t explain them here. CPD explanation

In addition, the author in the CPD article did not draw a specific network diagram for the modules he used

SINet draws two pictures

RF module:

insert image description here

PDC module:

insert image description here

SINet v2

The biggest difference between v2 and v1 is the attention part, v2 uses group inversion attention.

feature extraction

ResNet50 is still used, but unlike v1, only the features of the last three stages are required here, and the low-level features are discarded (the processing of the CPD framework is still used for reference)

Texture Enhanced Module texture enhancement module

The features extracted in the three stages all pass through a TEM. This is the RF module in v1, but the name has been changed, and the code has not changed.

Neighbor Connection Decoder Neighbor Connection Decoder

This is the PDC module in v1, just changed its name. no explanation

get C 6 C_6C6

Group-Reversal Attention group reverse attention

This group reverses the attention, the purpose is to erase the recognized objects and let the network follow up to focus on information in other areas.

insert image description here

This is the rough picture that is available C 6 C_6C6, negate first, and record this as yyy

Then the feature p 1 5 p^5_1 extracted by the backbone networkp15, recorded as xxx

The whole process is to divide x into several groups by channel, then insert y into it, and then convolution and fusion.

For example: p 1 5 p^5_1p15is xxx , the input is 32 channels, a total of three GRAs are performed, and the first time is divided into a group, which is directly equal to x, 32 channels, plus a reversedC 6 C_6C6, which is yyy , get 33 channels, after a 3x3 convolution, change back to 32 channels, and then ReLU again, you get a newxxx , and y, take this newxxx , convolution, the channel is compressed to 1 dimension, which is our newyyy , we also call it the attention score.

Then our new x and y, carry out the second GRA, this time the input x is 32 channels, divided into 4 groups, that is, each group has 8 channels, and we insert a y after each group, that is, each group All become 9 channels, and then sent to the convolution together, changed back to 32 channels, recorded as the new x, similarly, the attention score obtained after compressing the channels, recorded as the new y.

Then for the third GRA, we divided it into 32 groups, that is, a group of channels, and then added a y to each channel, which is 64 channels. Similarly, the convolution changed back to 32 channels, and the compressed channels got attention scores. The final y obtained is r 4 5 r^5_4 in the figurer45. C 6 C_6C6Plus, after another upsampling to restore the size, we get our C 5 C_5C5

C 4 C_4 C4 C 3 C_3 C3It is also the same.

The whole process is actually, C 6 C_6C6is the target we have discovered, and then put C 6 C_6 in the figureC6Eliminated, and then let the network go to search for the target again, after three rounds of search, and then put C 6 C_6C6Fill. It is equivalent to perfecting except C 6 C_6C6other details.

Then repeat this process on the feature maps obtained in the three stages of the backbone network, which is equivalent to supplementing details at each stage.

Finally, the output graph is obtained, and the entire network structure is like this.

The author of the GRA module drew a picture, as follows:

insert image description here

Summarize

The main contribution of this paper is to propose the systematic research task of camouflaged object detection.

The COD10K dataset was made.

SINet is proposed for detecting camouflaged objects.

SINetv1 is not very innovative, basically it is based on the network design of the following article

  • Cascaded Partial Decoder for Fast and Accurate Salient Object Detection

SINetv2 changed the structure of v1, and replaced the attention module with a group inversion attention module. The author said that he was inspired by the following papers

  • Pranet: Parallel reverse attention network for polyp segmentation,2020
  • Object region mining with adversarial erasing: A simple classification to semantic segmentation approach,2017
  • Reverse attention for salient object detection,2018

Guess you like

Origin blog.csdn.net/holly_Z_P_F/article/details/127560119