Infrared small target: DNANet network structure and model building


Target Detection)

1. Characteristics of infrared small targets and the contribution of this paper

Features of Infrared Small Target Detection

  1. Small targets
    Due to the long imaging distance, infrared targets are generally very small, ranging from one pixel to dozens of pixels in the image.

  2. Dim
    infrared targets usually have a low signal-to-clutter ratio and are easily trapped in strong noise and clutter backgrounds.

  3. Shapeless
    Small infrared targets have limited shape features.

  4. Variable
    The size and shape of infrared targets can vary greatly from scene to scene.

  5. Cannot use the network designed for general objects
    Since the size of infrared small objects is much smaller than that of general objects, directly applying these methods to SIRST detection will easily lead to the loss of deep small objects.

Several contributions of this paper

  1. A DNANet is proposed to maintain deep small objects. Through repeated feature fusion and enhancement, the background information of small objects can be well integrated and fully utilized.
  2. A densely nested interaction module and a channel-spatial attention module are proposed to achieve level-wise feature fusion and adaptive feature enhancement.
  3. An infrared small target dataset (i.e., NUDT-SIRST) was developed.
  4. Experiments on both public datasets and NUDT datasets demonstrate the superior performance of the method in this paper. Compared with existing methods, the method in this paper is more robust to changes in clutter background, object size and object shape.

2. Network structure analysis

Overall network structure of DNANet

The overall network structure of DNANet is shown in the figure below. (a) Feature extraction module. Firstly, the input image is fed into the Densely Nested Interaction Module (DNIM) to achieve level-by-level feature fusion. Then, the features at different semantic levels are adaptively enhanced with Channel and Spatial Attention Module (CSAM). (b) Feature Pyramid Fusion Module (FPFM). The enhanced features are upsampled and concatenated to achieve multi-layer output fusion. © Eight-connected neighborhood clustering algorithm. Cluster the segmentation map and finally determine the centroid of each object region
insert image description here

Feature Extraction Module

Inspired by U-Net, the author takes U-Net as the basic network structure and continuously increases the level of its network to obtain deeper semantic information and a larger receptive field. Considering the small characteristics of infrared small targets, the author designed a special module to extract deep features while maintaining the representation of deep small targets.

The author of DNIM – The Dense Nested Interactive Module
designed the DNIM module based on the above ideas. The author stacks multiple U-shaped structures together, and sets multiple nodes in the network to connect all nodes together. Each node can receive features from its own and adjacent layers to achieve repeated multi-layer feature fusion. This keeps representations of small objects deep

Let I be the DNIM layer here . take ithi^{th}it h (i = 0, 1, 2, ..., I). Li,j represents the output of node Li,j. whereii-th downsampling layeralong the encoderWhen j = 0, each node only receives features from dense voxel skip connections.
insert image description here
where F represents multiple cascaded convolutional layers, andPmaxrepresents the maximum pooling layer. When j > 0, each node receives output in three directions, that is,
insert image description here
U( ) represents the upsampling layer
insert image description here

CSAM – Channel and Spatial Attention Module
In the multi-layer feature fusion stage of DNIM, CSAM is used for adaptive feature enhancement to reduce the semantic gap. As shown below.
insert image description here
From the above figure, CSAM consists of two cascaded attention units, channel attention and spatial attention. Node L i , j L^{i,j}Li , j are followed by a one-dimensional channel attention map Mc∈RC i × 1 × 1 \R^{Ci×1×1}RC i × 1 × 1 and two-dimensional spatial attention map Ms∈R 1 × H i × W i \R^{1×Hi×Wi}R1 × H i × Wi for processing

channel attention
insert image description here

  • The feature maps pass through MaxPool and AvgPool respectively to form two weight vectors of [ C , 1 , 1 ]
  • The two weight vectors pass through the same MLP network (because it is the same network, it can also be regarded as an MLP with network parameter sharing), and are mapped to the weight of each channel
  • Add the mapped weights, followed by Sigmoid output
  • Multiply the obtained channel weights [ C , 1 , 1 ] with the original feature map [ C , H , W ] by channel

spatial attention
insert image description here

  • The feature maps are passed through MaxPool and AvgPool respectively to form two [1, H, W] weight vectors, that is, maximum pooling and average pooling by channel. The number of channels is changed from [ C , H , W ] to [ 1 , H , W ], and all channels of the same feature point are pooled.
  • The obtained two feature maps are stacked to form the feature map space weight of [ 2 , H , W ]
  • After a layer of 7×7 convolutional layers, the dimension of the feature map changes from [ 2 , H , W ] to [ 1 , H , W ], and the feature map of [ 1 , H , W ] represents each feature on the feature map. The importance of each point, the larger the value is more important
  • Multiply the obtained spatial weight [ 1 , H , W ] with the original feature map [ C , H , W ], that is, each point on the feature map [ H , W ] is given a weight

We can see it as a feature map with a size of [ H , W ], at each point ( x , y ), x ∈ ( 0 , H ), y ∈ ( 0 , W ), there are C values, and the numerical representation The importance of the point in the feature map is determined, and the original image is pushed back through the receptive field, which indicates the importance of the area. We need to let the network adaptively focus on the places that need to be paid attention to (places with large values ​​are more likely to be paid attention to)

Feature Pyramid Fusion Module

The enhanced features are upsampled and concatenated to achieve multi-layer output fusion, and the shallow features containing rich spatial and section information are connected with the deep features containing rich semantic information to generate a global robust feature map.
insert image description here

That is, L 4 , 0 L^{4,0}L4 , 0L 3 , 1 L^{3,1}L3 , 1L 2 , 2 L^{2,2}L2 , 2L 1 , 3 L^{1,3}L1 , 3L 0 , 4 L^{0.4}L0 , 4 upsampling to [C i , j C^{i,j}Ci,j, H 0 , 4 H^{0,4} H0,4, W 0 , 4 W^{0,4} W0 , 4 ], then splicing according to the channel, and finally get [C 0 , 4 C^{0,4}C0,4, H 0 , 4 H^{0,4} H0,4, W 0 , 4 W^{0,4} W0 , 4 ] output

Eight-connected neighbor clustering module

After the feature pyramid fusion module, an eight-connected neighborhood clustering module is introduced to perform clutter processing on all pixels and calculate the centroid of each object. If any two pixels g(m0,n0) in the feature map g, g(m1,n1) have intersection areas in their eight neighborhoods (such as formula 8), and have the same value (0 or 1) (such as Formula 9), the two pixels are considered to be in the connected region. Pixels in connected regions belong to the same object. Once all objects in the image are identified, the centroids are calculated as their coordinates.
insert image description here

3. Loss calculation

The network here is trained using Soft-IoU loss. Consistent with AGPCNet.
About Soft-IoU loss has been explained in AGPCNet
https://blog.csdn.net/weixin_33538887/article/details/126401466

4. Evaluation indicators

In terms of evaluation indicators, here are two commonly used indicators: detection rate P d and false alarm rate F a .

The detection rate P d is a target-level evaluation index. It measures the ratio of correctly predicted targets to all targets. The definition is as follows:
insert image description here
Among them, T correct and T All represent the number of correctly predicted targets and the number of all correct targets, respectively. Objects are considered to be correctly predicted objects if their centroid derivatives are smaller than the maximum allowed derivative. In this paper, the maximum centroid derivative is set to 3.

The false alarm rate F a is another target-level evaluation metric. It is used to measure the proportion of mispredicted pixels out of all image pixels. It is defined as follows:
insert image description here

Wherein, P false and P All represent the number of wrongly predicted pixels and the number of all image pixels respectively.

The ROC curve is used to describe the change trend of the detection probability (Pd) under different false alarm rates (Fa).
By plotting the false alarm rate on the horizontal axis and the detection rate on the vertical axis, a sequence of ROC curves can be obtained. The more convex the ROC sequence, the better the detection effect of the detection method on the sequence, that is, the larger the area enclosed by the curve and the horizontal axis, the better the detection performance.
insert image description here

5. Paper information

Paper download address: https://arxiv.org/pdf/2106.00487v3.pdf
Paper source code (PyTorch implementation): https://github.com/YeRen123455/Infrared-Small-Target-Detection
with dataset

Guess you like

Origin blog.csdn.net/weixin_33538887/article/details/126519082