Dense Nested Attention Network for Infrared Small Target Detection
Target Detection)
1. Characteristics of infrared small targets and the contribution of this paper
Features of Infrared Small Target Detection
-
Small targets
Due to the long imaging distance, infrared targets are generally very small, ranging from one pixel to dozens of pixels in the image. -
Dim
infrared targets usually have a low signal-to-clutter ratio and are easily trapped in strong noise and clutter backgrounds. -
Shapeless
Small infrared targets have limited shape features. -
Variable
The size and shape of infrared targets can vary greatly from scene to scene. -
Cannot use the network designed for general objects
Since the size of infrared small objects is much smaller than that of general objects, directly applying these methods to SIRST detection will easily lead to the loss of deep small objects.
Several contributions of this paper
- A DNANet is proposed to maintain deep small objects. Through repeated feature fusion and enhancement, the background information of small objects can be well integrated and fully utilized.
- A densely nested interaction module and a channel-spatial attention module are proposed to achieve level-wise feature fusion and adaptive feature enhancement.
- An infrared small target dataset (i.e., NUDT-SIRST) was developed.
- Experiments on both public datasets and NUDT datasets demonstrate the superior performance of the method in this paper. Compared with existing methods, the method in this paper is more robust to changes in clutter background, object size and object shape.
2. Network structure analysis
Overall network structure of DNANet
The overall network structure of DNANet is shown in the figure below. (a) Feature extraction module. Firstly, the input image is fed into the Densely Nested Interaction Module (DNIM) to achieve level-by-level feature fusion. Then, the features at different semantic levels are adaptively enhanced with Channel and Spatial Attention Module (CSAM). (b) Feature Pyramid Fusion Module (FPFM). The enhanced features are upsampled and concatenated to achieve multi-layer output fusion. © Eight-connected neighborhood clustering algorithm. Cluster the segmentation map and finally determine the centroid of each object region
Feature Extraction Module
Inspired by U-Net, the author takes U-Net as the basic network structure and continuously increases the level of its network to obtain deeper semantic information and a larger receptive field. Considering the small characteristics of infrared small targets, the author designed a special module to extract deep features while maintaining the representation of deep small targets.
The author of DNIM – The Dense Nested Interactive Module
designed the DNIM module based on the above ideas. The author stacks multiple U-shaped structures together, and sets multiple nodes in the network to connect all nodes together. Each node can receive features from its own and adjacent layers to achieve repeated multi-layer feature fusion. This keeps representations of small objects deep
Let I be the DNIM layer here . take ithi^{th}it h (i = 0, 1, 2, ..., I). Li,j represents the output of node Li,j. whereii-th downsampling layeralong the encoderWhen j = 0, each node only receives features from dense voxel skip connections.
where F represents multiple cascaded convolutional layers, andPmaxrepresents the maximum pooling layer. When j > 0, each node receives output in three directions, that is,
U( ) represents the upsampling layer
CSAM – Channel and Spatial Attention Module
In the multi-layer feature fusion stage of DNIM, CSAM is used for adaptive feature enhancement to reduce the semantic gap. As shown below.
From the above figure, CSAM consists of two cascaded attention units, channel attention and spatial attention. Node L i , j L^{i,j}Li , j are followed by a one-dimensional channel attention map Mc∈RC i × 1 × 1 \R^{Ci×1×1}RC i × 1 × 1 and two-dimensional spatial attention map Ms∈R 1 × H i × W i \R^{1×Hi×Wi}R1 × H i × Wi for processing
channel attention
- The feature maps pass through MaxPool and AvgPool respectively to form two weight vectors of [ C , 1 , 1 ]
- The two weight vectors pass through the same MLP network (because it is the same network, it can also be regarded as an MLP with network parameter sharing), and are mapped to the weight of each channel
- Add the mapped weights, followed by Sigmoid output
- Multiply the obtained channel weights [ C , 1 , 1 ] with the original feature map [ C , H , W ] by channel
spatial attention
- The feature maps are passed through MaxPool and AvgPool respectively to form two [1, H, W] weight vectors, that is, maximum pooling and average pooling by channel. The number of channels is changed from [ C , H , W ] to [ 1 , H , W ], and all channels of the same feature point are pooled.
- The obtained two feature maps are stacked to form the feature map space weight of [ 2 , H , W ]
- After a layer of 7×7 convolutional layers, the dimension of the feature map changes from [ 2 , H , W ] to [ 1 , H , W ], and the feature map of [ 1 , H , W ] represents each feature on the feature map. The importance of each point, the larger the value is more important
- Multiply the obtained spatial weight [ 1 , H , W ] with the original feature map [ C , H , W ], that is, each point on the feature map [ H , W ] is given a weight
We can see it as a feature map with a size of [ H , W ], at each point ( x , y ), x ∈ ( 0 , H ), y ∈ ( 0 , W ), there are C values, and the numerical representation The importance of the point in the feature map is determined, and the original image is pushed back through the receptive field, which indicates the importance of the area. We need to let the network adaptively focus on the places that need to be paid attention to (places with large values are more likely to be paid attention to)
Feature Pyramid Fusion Module
The enhanced features are upsampled and concatenated to achieve multi-layer output fusion, and the shallow features containing rich spatial and section information are connected with the deep features containing rich semantic information to generate a global robust feature map.
That is, L 4 , 0 L^{4,0}L4 , 0、L 3 , 1 L^{3,1}L3 , 1、L 2 , 2 L^{2,2}L2 , 2、L 1 , 3 L^{1,3}L1 , 3、L 0 , 4 L^{0.4}L0 , 4 upsampling to [C i , j C^{i,j}Ci,j, H 0 , 4 H^{0,4} H0,4, W 0 , 4 W^{0,4} W0 , 4 ], then splicing according to the channel, and finally get [C 0 , 4 C^{0,4}C0,4, H 0 , 4 H^{0,4} H0,4, W 0 , 4 W^{0,4} W0 , 4 ] output
Eight-connected neighbor clustering module
After the feature pyramid fusion module, an eight-connected neighborhood clustering module is introduced to perform clutter processing on all pixels and calculate the centroid of each object. If any two pixels g(m0,n0) in the feature map g, g(m1,n1) have intersection areas in their eight neighborhoods (such as formula 8), and have the same value (0 or 1) (such as Formula 9), the two pixels are considered to be in the connected region. Pixels in connected regions belong to the same object. Once all objects in the image are identified, the centroids are calculated as their coordinates.
3. Loss calculation
The network here is trained using Soft-IoU loss. Consistent with AGPCNet.
About Soft-IoU loss has been explained in AGPCNet
https://blog.csdn.net/weixin_33538887/article/details/126401466
4. Evaluation indicators
In terms of evaluation indicators, here are two commonly used indicators: detection rate P d and false alarm rate F a .
The detection rate P d is a target-level evaluation index. It measures the ratio of correctly predicted targets to all targets. The definition is as follows:
Among them, T correct and T All represent the number of correctly predicted targets and the number of all correct targets, respectively. Objects are considered to be correctly predicted objects if their centroid derivatives are smaller than the maximum allowed derivative. In this paper, the maximum centroid derivative is set to 3.
The false alarm rate F a is another target-level evaluation metric. It is used to measure the proportion of mispredicted pixels out of all image pixels. It is defined as follows:
Wherein, P false and P All represent the number of wrongly predicted pixels and the number of all image pixels respectively.
The ROC curve is used to describe the change trend of the detection probability (Pd) under different false alarm rates (Fa).
By plotting the false alarm rate on the horizontal axis and the detection rate on the vertical axis, a sequence of ROC curves can be obtained. The more convex the ROC sequence, the better the detection effect of the detection method on the sequence, that is, the larger the area enclosed by the curve and the horizontal axis, the better the detection performance.
5. Paper information
Paper download address: https://arxiv.org/pdf/2106.00487v3.pdf
Paper source code (PyTorch implementation): https://github.com/YeRen123455/Infrared-Small-Target-Detection
with dataset