RES Series GRES: Generalized Referring Expression Segmentation Paper Reading Notes


write in front

  This week is almost over and I haven’t done much. After I came to school, I reinstalled the Ubuntu system, and the previous one crashed. Now back up using the command line.
  This is an article about a new data set of RES. Let’s see if there is any opportunity to publish some papers.

1. Abstract

  First of all, we point out the definition of Referring Expression Segmentation (RES). The current classic RES data sets and methods generally only support single-target expressions, that is, one expression corresponds to one target, and do not consider multi-target and targetless ones. expression. Therefore, this article proposes Generalized Referring Expression Segmentation (GRES), which enables expressions to point to any number of target categories. At the same time, the first large-scale GRES data set gRefCOCO was constructed, including expressions of multiple targets, no targets, and single targets. Experiments show that the main problem of GRES lies in complex relationship modeling. Based on this, a region-based ReLA model is proposed, which can adaptively divide the image into several regions containing instances, and explicitly model region-region and region-language dependencies. The proposed ReLA performs well on GRES and classic RES datasets.

2. Introduction

  
  Definition, application, and data set of Referring Expression segmentation (RES).

  • Limitations of Classic RES
    First, classic RES does not consider expressions without targets, which means that existing RES methods cannot define those image-text pairs whose targets are not on the input image. Second, most existing RES datasets such as RefCOCO do not contain expressions for multiple targets. As shown below:

Insert image description here
  Experiments show that traditional RES methods fail to identify these multiple-target or target-less scenes well.

  • New benchmark and dataset
    This article proposes a new benchmark: Generalized Referring Expression Segmentation (GRES), which enables expressions to point to any number of targets. GRES also takes an image and a referential expression as input, but further supports multiple targets contained in a single expression and no targets contained in an image. In contrast, existing referential expression datasets do not contain these, but only expressions of single targets. The new data set established for GRES is called gRefCOCO, which complements RefCOCO with two types of samples: multi-objective expressions and non-objective expressions.
  • New baseline method
    Region-region interaction is very important in RES. However, the classic RES method only needs to detect one target, so that most methods without region-region interaction modeling can achieve good performance. In GRES, however, there will be greater reliance on long-range region-region dependence modeling. So a network was designed to divide the image into several regions, and then explicitly interact with each other. In addition, the network proposed in this paper can softly collect the features of each region to achieve greater flexibility. Extensive experiments show that explicitly modeling elastic interactions of regional features can contribute to the performance of GRES.

  The contributions of this article are summarized as follows:

  • A bechmark is proposed: Generalized Referring Expression Segmentation (GRES), making RES more flexible and easy to deploy in real scenarios.
  • The first one proposed a large-scale GRES dataset gRefCOCO, which can support the expression of any number of targets.
  • A baseline method: ReLA is proposed to model the complex relationships between targets, and implements SOTA on the classic RES and GRES tasks.
  • A large number of experiments analyzed the possible reasons and new challenges for the performance gap between RES and GRES.

3. Related work

Related referential tasks and data sets

  Referring Expression Comprehension (REC) output bounding box, early data sets for RES and REC have ReferIt, but each expression only points to a single instance. Then the RefCOCO and RefCOCOg data sets were proposed. Again these are one expression corresponding to one instance.
  Recently, some new data sets have been proposed, but they do not focus on or are not suitable for GRES. PhraseCut, for example, contains multiple target expressions, but only as candidates, that is, only when a target cannot be used as an independent referent. In addition, its expressions are template compositions rather than natural language expressions. Image caption Image caption is also close to RES, but does not ensure obfuscation of expressions and is therefore not suitable for referencing related tasks. Some refer to data sets using other modalities or learning plans, such as Scanrefer focusing on 3D objects and Clevrtex focusing on unsupervised learning. Furthermore, none of these datasets contain expressions without targets.

Referent segmentation method

  There are roughly two types: single-stage (top-down) and two-stage (bottom-up). Single-stage methods usually have an end-to-end network similar to FCN, and achieve prediction by classifying multi-modal features pixel by pixel. The two-stage method first finds a set of instance proposals and then selects the target instance from it. Most RES are single-stage methods, while REC is a two-stage method. Recent Transformer methods significantly surpass CNN-based methods. The zero-shot segmentation method uses category names as text information, focuses on identifying new categories, and uses natural language expressions to identify the user's focus.

4. Task settings and data sets

4.1 GRES settings

RES Review

  The current RES does not consider targetless expressions, so existing models may output wrong instance results when dealing with multiple referents or no target referents on the input image. As shown below:

Insert image description here

Generalized RES

  In order to solve the limitations of classic RES, this article proposes a new benchmark: Generalized Referring Expression Segmentation (GRES), which can allow expressions to point to any number of target objects. A GRES data sample contains four elements: an image III , a language expressionTTT , one byTTT points to the GT Mask MGT M_{GT}containing all pixelsMGT, a binary target label EGT E_{GT}EGT, indicating that TTWhether T is an untargeted expression. TTThere is no limit to the number of instances in T. The GRES model is based onIII andTTT is the input, predict a maskMMM. _ For targetless expressions,MMM should be negative.
  Multiple target expressions can achieve user-defined open vocabulary segmentation, and target-free expressions can identify whether the image contains targets in language expressions, so it is more practical.

Evaluate

  In gRefCOCO, models are not forced to provide different instance descriptions. In addition to the general RES performance indicators: cumulative IoU (cIoU) and Precision@X, a new indicator is further proposed: generalized IoU (gIoU), which extends the average IoU to all samples (including untargeted samples) . In addition, the performance of target-free samples is evaluated separately by calculating No-target-accuracy (N-acc.) and Target-accuracy (T-acc.).

4.2 gRefCOCO: a large-scale GRES dataset

  gRefCOCO contains 278232 expressions: 80022 multi-target and 32202 no-target, pointing to 60287 salient instances in 19994 images. The bounding box and mask of all target instances are given. Some single-objective expressions are derived from RefCOCO. An online annotation tool was developed to find images, select instances, write expressions and verify results.
  The basic annotation process originates from ReferIt, ensuring annotation quality. Keep the dataset proportions the same as the UNC part of RefCOCO.

Insert image description here

Multiple target samples

  Usually users focus on selecting multiple targets based on logical relationships or similarities, rather than randomly assembling multiple targets. The annotator then writes an unobfuscated referential expression to select these instances, with the following four properties and challenges.

Use of counting expressions

  The original expression in Figure 3(a) contains an ordinal word, so the model must distinguish important quantities from the ordinal word quantities. Explicit or implicit target counting capabilities can be used to solve such expressions.

Compound word structure without set relationship

  As shown in Figure 3, "A and B", "A except B", "A with B or C", this puts higher requirements on the model to understand the long-distance dependence of images and sentences.

attribute domain

  When multiple targets are present in an expression, the different targets may share the same or different properties. This requires the model to understand all attributes and be able to map these attributes to the corresponding targets.

more complex relationship

  As shown in Figure 3(b), two similar expressions are applied to one image but point to two different target sets. Therefore, in GRES, relationships are not only used to describe goals but also imply the number of goals. This requires the model to understand all interactions between instances.

No target sample

  When untargeted expressions do not set any conditional pairs, annotators tend to write a large number of simple or general expressions that are very different from those that are valid. So two rules are set: the expression cannot be irrelevant to the overall image; the annotator can choose a misleading expression from other RefCOCO images with the same data distribution.

5. Proposed method

Insert image description here

5.1 Structure preview

  The model structure is shown in the figure above. The input image is extracted by the Swin-Transformer encoder to extract the visual features F i ∈ RH × W × C F_i\in \mathbb{R}^{H\times W\times C}FiRH × W × C , whereHHH W W W is the spatial size of the image,CCC is the channel dimension. The input language expression is processed by BERT to obtain the language featureF t ∈ RN t × C F_t\in \mathbb{R}^{N_t\times C}FtRNt× C , whereN t N_tNtis the number of words in the expression. Next, F i F_iFiSend it to the pixel decoder to get the mask feature F m F_mFm. At the same time F i F_iFiand F t F_tFtIt is sent to the proposed ReLAtionship building module to further decompose the feature map into P × P = P 2 P\times P=P^2P×P=P2 areas and then model the relationship between them. However, the shape and size of the feature map and its spatial area are not predetermined, but dynamically divided by ReLA. ReLA produces two types of features: regional featuresF r = { frn } n = 1 P 2 F_r=\{f_r^n\}^{P^2}_{n=1}Fr={ frn}n=1P2and area filter F f = { ffn } n = 1 P 2 F_f=\{f_f^n\}^{P^2}_{n=1}Ff={ ffn}n=1P2. For the nnthn regions, their regional characteristicsfrn f_r^nfrnUsed to find a scale xrn x_r^nxrn, indicating the probability that it contains the target, and its area filter ffn f_f^nffnWith mask feature F m F_mFmPerform dot multiplication to obtain the region segmentation mask M rn ∈ R h × W M_r^n\in \mathbb{R}^{h\times W}MrnRh × W , indicating the area of ​​this region. Next we aggregate these masks by weighting them on the predicted masks:
M = ∑ n ( xrn M rn ) M=\sum_n(x_r^nM_r^n)M=n(xrnMrn)

output and loss

  Predicted mask MMM through GT maskMGT M_{GT}MGTsupervision. P × PP\times PP×P probability mapxr x_rxrBy M_{GT} from MGTMGTDownsampling to the smallest graph for supervision. At the same time, through all regional features F r F_rFrglobal average features to predict an untargeted label EEE. _ During reasoning, ifEEE predicts a positive value, then the output maskMMM will be set to empty. MMM x r x_r xrEEE guides training through cross-entropy loss.

5.2 ReLAtionship modeling

  The proposed ReLAtionship modeling contains two modules: Region-Image Cross Attention (RIA), Region-Language Cross Attention (RLA). RIA elastically collects regional image features, and RLA captures region-region and region-image dependencies.

Insert image description here

Region-Image Cross Attention (RIA)

  RIA uses visual features F i F_iFiand P 2 P^2P2 learnable region-based queriesQ r Q_rQrAs input, it is supervised by the minimal graph in Figure 4. Each query corresponds to a spatial area in the image and is responsible for decoupling the features of this area. Its structure is shown in Figure 5(a). First, by performing image features F i F_iFiand P 2 P^2P2 个 query embedding Q r ∈ R P 2 × C Q_r\in\mathbb{R}^{P^2\times C} QrRP2 ×Cattention to generateP 2 P^2P2 attention maps:
A ri = softmax ⁡ ( Q r σ ( F i W ik ) T ) A_{\boldsymbol{r}i}=\operatorname{softmax}(Q_r\sigma(F_iW_{\boldsymbol{i}k })^T)Ari=softmax(Qrs ( FiWi k)T )whereW rk W_{rk}Wrkis C × CC\times CC×C learnable parameters,σ \sigmaσ is GeLU. The output resultA ri ∈ RP 2 × HW A_{\boldsymbol{r}i}\in\mathbb{R}^{P^2\times HW}AriRP2 ×HWH × WH\times Wto each queryH×The attention map of W indicates its corresponding area in the image. Next use these attention maps to get regional features from the corresponding regions:
F r ′ = A ri σ ( F i W iv ) T \begin{aligned}F_r'&=&A_{ri}\sigma({F_iW_{iv} })^T\end{aligned}Fr=Aris ( FiWiv)TAmong them W iv W_{iv}Wivis C × CC\times CC×C learnable parameters. In this approach, the features of each region are dynamically collected through their relative locations. Mask-based predictionF r ′ F_r^{\prime}FrGet a region filter F f F_fFf, contains regional clues. F r ′ F_r^{\prime}FrIt is further fed into RLA for region-region and region-word interaction modeling.

Region-Language Cross Attention (RLA)

  Regional image feature F r ′ F_r^{\prime}FrOriginating from the image features that do not contain regional and language information relationships, the RLA module is proposed to model region-region and region-language interactions. As shown in Figure 5(b), RLA consists of a self-attention mechanism and multi-modal cross-attention. The self-attention module models region-region dependencies. Output the relationship-aware region feature F r 1 F_{r1} by calculating the attention matrix of a single region with all other regionsFr 1. At the same time, cross attention adopts language features F t F_tFtis the input of Value and key, regional image feature F t F_tFtAs query input. So the relationship between each word and each region is modeled:
A l = softmax ⁡ ( σ ( F r ′ W lq ) σ ( F t W lk ) T ) A_l=\operatorname{softmax}(\sigma(F_r^{ \prime}W_{lq})\sigma(F_tW_{lk})^T)Al=softmax ( σ ( FrWlq) p ( FtWlk)T)其中 A l ∈ R P 2 × N t A_l\in\mathbb{R}^{P^2\times N_t} AlRP2×Nt. The word-region attention is then shaped into regional features of language perception: F r 2 = A l F t F_{r2}=A_lF_tFr 2=AlFt. Finally, an MLP is used to aggregate the interactive sensing area features F r 1 F_{r1}Fr 1and regional features of language perception F r 2 F_{r2}Fr 2and regional image features F r ′ F_r^{\prime}Fr F r = MLP ( F r ′ + F r 1 + F r 2 ) F_r=\text{MLP}(F_r^{\prime}+F_{r1}+F_{r2}) Fr=MLP(Fr+Fr 1+Fr 2)

6. Experiment

6.1 Evaluation indicators

  In addition to the widely used RES indicators: cumulative IoU (cIoU) and Precision@X (Pr@X), No-target accuracy (N-acc.), Target accuracy (T-acc.), and generalized IoU (gIoU) are further introduced. for GRES.

cIoU and Pr@X.

  cIoU calculates the ratio of the overall intersection pixel to the full set of pixels. Pr@X counts when IoU is greater than a certain threshold XXThe percentage of sample size at X. Samples without targets are not calculated in Pr@X. Since multi-target samples have a larger foreground area, it is easier to obtain higher cIoU scores. So the thresholdXXX is used for Pr@X starting from 0.7.

N-acc. and T-acc.

  These two indicators evaluate the performance of the model on untargeted samples: for untargeted samples, when there are no foreground pixels in the prediction, it is a positive sample (TP), otherwise it is a negative sample (FN). Then N-acc. is used to measure the performance of the model on untargeted samples: N-acc. = TPTP + FN \text{N-acc.} =\frac{ {T}P}{TP+FN}N-acc.=TP+FNTP

  T-acc. reflects the degree to which non-target samples affect target samples, that is, how many target samples are misclassified as non-target samples: T-acc. = TNTN + FP \text{T-acc.} =\frac { {T}N}{TN+FP}T-acc.=TN+FPTN

gIoU.

  Since larger objects in cIoU have higher scores, it happens that targets in GRES have greater prospects. Therefore, generalized IoU (gIoU) is introduced to treat all samples equally. gIoU calculates the average over all samples for each image. For those untargeted samples, the IoU value of positive untargeted samples is regarded as 1, while the IoU value of false negative samples with targets is regarded as 0.

6.2 Ablation studies

The need for data sets

Insert image description here

RIA design options

Insert image description here

RLA design options

Insert image description here

Regional PPnumber of P

Insert image description here

6.3 Results on GRES

Comparison with RES sota method

Insert image description here

Quantitative results

Insert image description here

Failure cases & discussions

Insert image description here
  The main reason is that the expression is too deceptive. The model needs to further find the fine-grained details of all objects and understand the details of the image context.

6.4 Results on classic RES

Insert image description here

7. Conclusion

  This paper analyzes and solves the limitations of classical RES tasks, namely the inability to solve multimodal and goalless expressions. Based on this, a new benchmark is proposed: Generalized Referring Expression Segmentation (GRES), which allows any number of targets to be included in the expression. Correspondingly, a large-scale data set gRefCOCO was constructed, and a baseline method was proposed: ReLA, which explicitly models the relationship between different image areas and words, and implements a new SOTA on the classic RES and GRES tasks. The proposed GRES opens up new application scenarios, such as image retrieval.

write on the back

  This is an article with a lot of work. It proposes data sets, methods, and indicators. It can be said that it has dug a big hole. Interested students are advised to pay attention.

Guess you like

Origin blog.csdn.net/qq_38929105/article/details/132256322