RIS Series Semantics-Aware Dynamic Localization and Refinement for Referring Image Segmentation Paper Reading Notes


write in front

  It’s another weekend, and it’s estimated that everyone has already started school this week. Then continue writing your thesis and graduate early~

1. Abstract

  Referring image segmentation (RIS) aims to produce high-quality masks. Existing methods always require iterative learning methods, relying on RNNs or stacked attention layers to refine visual-linguistic features. However, RNN-based methods rely on specific encoders, and attention-based methods have very low benefits. Therefore, this article introduces a method of progressively learning multi-modal features. The core idea is to use a continuously updated query as the representation of the target, and in each iteration step to strengthen the multi-modal features related to qeury and weaken the irrelevant features , so it can gradually shift from the positioning center to the segmentation center. The proposed method can also be directly inserted into some SOTA frameworks, and experiments show that it achieves SOTA effects.

Insert image description here

2. Introduction

  First, let’s introduce the definition and application of Referring image segmentation (RIS), and the difference between it and traditional semantic/instance segmentation.
  There are currently some methods based on fully convolutional structures that design cross-modal feature fusion modules to align language and visual cues in a common space. However, language and visual input may contain a large amount of redundant noise, so iterative learning methods emerged: within a number of iterations, cross-modal feature alignment and refinement are gradually established to achieve more accurate segmentation accuracy.
  Existing SOTA methods usually perform iterative learning through recurrent neural networks (RNNs) or stacked attention layers, while Transformer-based language models are able to iteratively integrate multi-scale visual language features, but do not directly address cross-modal alignment. There are some methods that introduce cascade group attention methods to refine cross-modal features of the entire image, but they lack prior information to focus on those areas that need attention, especially local areas when targets are often occluded.
  Therefore, this article proposes a language-aware dynamic localization and refinement method semantics-aware dynamic localization and refinement (SADLR), in which a language-aware dynamic convolution module performs conditional convolution on the updated target representation (query), by gradually strengthening Target information thereby suppressing background noise. To assist positioning, query is first initialized with the language feature vector of the input expression, and then a convolution kernel is predicted on the multi-modal feature map. In each subsequent step, the query is updated using the target context pooled in the previous step, and dynamic convolution is performed again.
  Experiments show that the method in this article works well when applied to three SOTA methods (LTS, VLT, LAVT).

3. Related work

Referring image segmentation

  Previous CNN-based methods focused on cross-modal feature fusion, and some recent methods utilize Transformer structures to aggregate visual-linguistic representations. In order to alleviate some difficulties in cross-modal context modeling, some methods adopt the design of iterative learning and cross-modal feature refinement. These methods mainly capture multi-modal language information from visual features. The work in this paper is most relevant to CGAN, which utilizes cascaded multi-head attention layers to iteratively optimize the cross-modal map of the entire image. Differently, CGAN does not seek to leverage predictions as priors to help refine more important regions. Experiments show that the cascade attention model is more effective for those structures based on traditional models (non-Transformer). The figure below is a comparison chart of several methods:

Insert image description here
  The method proposed in this article has two advantages over the above methods: it does not rely on the selection of a specific language encoder or multi-modal feature encoding scheme; it can be generalized to a large number of SOTA models.

Dynamic convolution

  Dynamic convolution is widely used. It was first proposed in short-range weather prediction, and recently it is popular in some query-based target detection methods. In these models, a set of learnable embeddings are set as potential target features, and dynamic weights are used to locate and classify targets based on these queries. In contrast, this paper utilizes dynamic convolution to solve the natural language segmentation problem.
  Among those methods that address video object segmentation, dynamic convolution/filtering refers to predicting classification weights with the same number of channels, thereby acting on feature maps. In contrast, this paper aims to generate a set of informative feature maps that highlight the semantics related to dynamic queries.

Multi-modal reasoning

  Multimodal reasoning includes large-scale pre-training, general Transformer model, and MDETR.

4. Method

4.1 Overview

  
Insert image description here

basic knowledge

  As shown in Figure 3, the input consists of an image and an expression. First, use a language encoder, such as BERT or RNN, to extract a set of language features L ∈ RC l × NL\in \mathbb{R}^{C_l\times N} from the input expressionLRCl× N , whereC l C_lClrepresents the number of channels, NNN is the number of words. Then the input image and language featuresLLL is fed into a multi-modal feature encoder network to ensure that the linguistic features and visual features are together in a single feature space, so that aligned information is captured. Denote this multimodal feature map asY ∈ RC × H × WY\in \mathbb{R}^{C\times H\times W}YRC × H × W , whereCCC is the number of channels,HHH W W W are the height and width of the image respectively.

Semantic-aware dynamic convolution

  In subsequent processing, multi-modal features are iteratively refined through a semantic-aware dynamic convolution module. Each layer of this module predicts a convolution kernel from the input feature vector, and then convolves the multi-modal feature map. This article calls the feature vector of this generation kernel "query".

route for each iteration

Suppose there are nn   during the refining processn iterations, the dynamic query will be performed in each iteration processi ∈ { 1 , 2 , … , n } i\in \{1,2,\ldots,n\}i{ 1,2,,n } is expressed asQ i ∈ RC Q_i\in \mathbb{R}^{C}QiRC. _ During each iteration, queryQ i Q_iQiand multimodal feature map YYY (from the encoding stage) is sent to the semantic-aware dynamic convolution module, where each convolution kernel comes fromQ i Q_iQiYYConvolution of Y. The output of the semantic-aware dynamic convolution module is a set withQ i Q_iQiThe related feature map is expressed as Z i ∈ RC × H × W Z_i\in \mathbb{R}^{C\times H\times W}ZiRC × H × W , then use1 × 1 1\times11×1卷约将Z i Z_iZiProjected onto a rough score map, expressed as R i ∈ R 2 × H × W R_i\in \mathbb{R}^{2\times H\times W}RiR2 × H × W , where0 00 means background,1 11 represents the target category. InR i R_iRiApply argmax to obtain the binarized target mask M i ∈ RH × W M_i\in \mathbb{R}^{H\times W}MiRH×W

Query initialization and iterative update

  Within different iterations, use different types of queries to emphasize different types of features. First, in the first iteration, use the sentence feature vector S ∈ RS\in \mathbb{R}SR to initializeQ 1 Q_1Q1. where the output LL of the language encoder is passedL adds an average pooling and a linear projection of the channel dimension to getSSS. _ SinceQ 1 Q_1Q1Contains pure language information, the output of the semantic-aware dynamic convolution module Z 1 Z_1Z1It can highlight the information that refers to the positioning. in section iiDuring i iterations, use M i − 1 M_{i-1}Mi1Get new Q i Q_iQi,而 Q i − 1 Q_{i-1} Qi1The updates are from YYPooled target feature vectorO i − 1 O_{i-1} in YOi1. This process is as follows:
O i − 1 = A vg P ool ( M i − 1 , Y ) Q i = Q i − 1 + O i − 1 \begin{aligned} O_{i-1}& =\mathrm{ AvgPool}(M_{i-1},Y) \\ Q_{i}& =Q_{i-1}+O_{i-1} \end{aligned}Oi1Qi=AvgPool(Mi1,Y)=Qi1+Oi1where A vg P ool \mathrm{AvgPool}AvgPool represents inYYCalculate the weighted average of the feature vectors in the spatial dimension of Y , and the binary maskM i − 1 M_{i-1}Mi1is the weight diagram. This query update plan aims to gradually integrate target context information into the query, so that M i − 1 M_{i-1} in each iterationMi1more precise.

4.2 Multi-modal feature encoding

  By default, LAVT is used as the multi-modal feature encoding network, using cascaded visual Transformers to embed visual and language information. In each stage of the visual backbone, a pixel-word attention is used to align linguistic features and visual features at each spatial location, and a linguistic path directly integrates multimodal cues to the next stage of the visual backbone. Nonetheless, our approach is not limited to a specific multimodal feature encoding network.

4.3 Semantic-aware dynamic convolution

Insert image description here
  As shown in Figure 4 above, the semantic-aware dynamic convolution module consists of two layers of dynamic convolution. At each SADLR iteration, given query Q i Q_iQi, first apply a linear projection layer to generate the dynamic kernel K i ∈ RC × C ′ K_i\in\mathbb{R}^{C\times C^{\prime}}KiRC×C , whereCCC andC ′ C^{\prime}C are the number of input and output channels respectively. The feature map YYinput laterY and dynamic kernelK i K_iKiMultiplication, followed by a layer of normalization and ReLU nonlinear activation. Another convolution kernel K i ′ ∈ RC ′ × C K_i^{\prime}\in\mathbb{R}^{C^{\prime}\times C}KiRC ×Cthrough the second linear projection onQ i Q_iQiproduced on. Similarly, the output of the previous layer and K i ′ K^{\prime}_iKiMatrix multiplication is performed between layers, followed by a layer of regularization and ReLU nonlinear activation layers.

4.4 Prediction Masks and Loss Functions

  By upsampling the rough score map R i R_iRiAdjust to the resolution of the input image, and then use argmax along the channel dimension to get the predicted masks. The loss function is as follows:
L = λ 1 L 1 + λ 2 L 2 + . . . + λ n L n L=\lambda_{1}L_{1}+\lambda_{2}L_{2}+...+ \lambda_{n}L_{n}L=l1L1+l2L2+...+lnLnAmong them L i L_iLiRepresents the iithSingle loss for i iterations,λ i \lambda_iliis the balance weight of the loss. Each loss uses the average Dice loss to calculate the target category and background category. During inference, the last iteration mask is used for prediction.

5. Experiment

5.1 Datasets and evaluation metrics

data set

  RefCOCO、RefCOCO+、G-Ref。

Evaluation indicators

  Accuracy precision@K (P@K), where K represents the IoU threshold; mean IoU (mIoU); overall IoU (oIoU).

5.2 Implementation details

  Most training and inference settings are the same as LAVT. BERT-base is the language encoder, and Swin-B is the Backbone network. BERT weights come from HuggingFace, Swin initialized weights come from pre-trained weights on ImageNet-22K, and the remaining parameters are randomly initialized. End-to-end training, AdamW optimizer, initial learning rate 5 e − 5 5e-55 e5 , weight decay2 22 , "poly" learning rate plan. Number of channelsC l = 3 C_l=3Cl=3 C = 512 C=512 C=512 . The default number of iterationsn = 3 n=3n=3 , so the weight hyperparameterλ 1 \lambda_1l1λ 2 \lambda_2l2λ 3 \lambda_3l3Set to 0.15, 0.15, and 0.7 respectively (when ensuring that the overall weight is 1, apply the maximum weight value (≥0.5) to the final iteration, and divide the rest equally). Training 40 epochs, batch_size 32. Input image size 480 × 480 480\times480480×480 , sentence length 20.

5.3 Comparison with other methods

Insert image description here
Insert image description here

5.4 Ablation experiment

Comparison with CGAN and baseline methods

Insert image description here

number of iterations

Insert image description here

Structure of semantic-aware dynamic convolution module

  Table 4(b) above.

Query update method

  Table 4(c) above.

6. Conclusion

  This paper proposes a semantic-aware dynamic localization and refinement method for RIS, which iteratively refines multi-modal feature maps based on aggregated target context. Experiments show that the method has good generalization and effect.

appendix

Insert image description here
Insert image description here
Insert image description here

write on the back

  This article proposes a dynamic convolution structure and iterative method, just take a look and digest it. The description of parameters appears repeatedly in the paper. It seems that this is not a mistake that humans can make. Combined with the acknowledgments, it is not surprising that some of them were written by AI. In addition, Figures 1 and 2 are not cited in the paper, so please note.

Guess you like

Origin blog.csdn.net/qq_38929105/article/details/132645800