空间注意力机制:GMA of DPANet

Taken into account that there exist complementarity and inconsistency of the cross-modal RGB-D data, directly integrating the cross-modal information may induce negative results, such as contaminations from unreliable depth maps.
Besides, the features of the single modality usually are affluent in spatial or channel aspect, but also include information redundancy.
To cope with these issues, we design a GMA module that exploits the attention mechanism to automatically select and strengthen important features for saliency detection, and incorporate the gate controller into the GMA module to prevent the contamination from the unreliable depth map.

To reduce the redundancy of single-modal features and highlight the feature response on the salient regions, we apply spatial attention (see ‘S’ in Fig. 3) to the input feature rbi and dbi, respectively.
The process can be described as:

考虑到跨模态RGB-D数据的互补性和不一致性,直接整合跨模态信息可能会导致负面结果,如不可靠深度图的污染。
此外,单模态的特点通常是空间或通道方面丰富,但也包含信息冗余。
针对这些问题,我们设计了一个GMA模块,利用注意机制自动选择和加强重要特征以进行显著性检测,并将门控制器纳入GMA模块,以防止不可靠的深度图造成污染

为了减少单模态特征的冗余,突出突出显著区域上的特征响应,我们对输入特征rbi和dbi分别应用空间注意(图3中的S)。
这个过程可以描述为:

where fin represents the input feature of the RGB branch or depth branch (i.e., rbi or dbi), 
convi (i = 1,2) refers to the convolution layer, 
⊙ denotes element-wise multiplication, 
δ is the ReLU activation function,
and fout represents the modified RGB/depth feature (i.e., rbi or dbi).
The channels of modified feature rbi, dbi are unified into 256 dimensions at each stage.
Note that, the weights are not shared for the RGB and depth branches in our model

Further, inspired by the success of self-attention [54], [55], we design two symmetrical attention sub-modules to capture long-range dependencies from a cross-modal perspective. Taking Adr in Fig. 3 as an example, Adr exploits the depth information to generate a spatial weight for RGB feature rbi, as depth cues usually can provide helpful information

(e.g., the coarse location of salient objects) for RGB branch. Technically, we first apply 1 × 1 convolution operation to ̃ C1×(HW) C1×(HW) projectthedbi intoWq ∈R ,Wk ∈R ,and

̃ C×(HW) projecttherbiintoWv∈R ,whereC,z,Wreferto the channel, height, width of the feature Wv, respectively, and C1 is set to 1/8 of C for computation efficiency. We compute the enhanced feature as follows:

 

Guess you like

Origin blog.csdn.net/zjc910997316/article/details/118768758