PolyFormer: Referring Image Segmentation as Sequential Polygon Generation Paper Reading Notes

PolyFormer: Referring Image Segmentation as Sequential Polygon Generation Paper Reading Notes

write in front

  I am almost busy with the things I should be busy with. I will continue to read papers and write notes, and make up for what I missed in the previous two months! (There are still 7 blog posts left), welcome to pay attention and continue to be productive.

1. Abstract

  Current reference image segmentation generally does not directly predict the target mask, but rather as a polygon sequence generation task. This article proposes a regression-based decoder that can directly predict precise geometric position coordinates, abandoning the previous practice of needing to quantify coordinates onto a fixed grid. Not only does it perform well on conventional data sets, but it also generalizes well to reference video image segmentation.

2. Introduction

  As usual, let’s talk about the definition of the reference image segmentation task and click on the routine of traditional instance segmentation, which is now based on contour points.
  As shown in Figure 1, this paper proposes a sequence-sequence model Polygon transFormer (PolyFormer) to solve this task. The prediction of each vertex is based on the previously predicted contour vertex, so the overall predicted contour points are no longer independent of each other. The input and output formats of PolyFormer are fixed, a set of variable-length sequences, which avoids the previous multi-modal fusion operation. For the output sequence, multiple polygon outlines are separated by specific separators. The framework proposed in this article can also be directly used to predict bounding boxes, because it is equivalent to predicting two coordinates. So the network built in this article unifies the two tasks of reference image segmentation and understanding (what a gimmick!).

Insert image description here
  Recent Seq2Seq models quantize precise coordinates into a set of grids, thereby transforming the coordinate prediction task into a classification task. This article points out that this approach is not very good, because no matter how quantified, it is not accurate, because the coordinates are in a continuous space, and quantization cannot avoid errors. Therefore, this article treats it as a regression task, because coordinate regression is a floating point number, and the quantization error is negligible. Inspired by Mask RCNN, this paper uses bilinear interpolation method to interpolate floating point coordinates based on two adjacent index embeddings.

  Contributions are as follows:

  • This paper proposes PolyFormer to solve the tasks of Reference Image Segmentation (RIS) and Reference Expression Understanding (REC), treating them as sequence-to-sequence prediction problems.
  • A regression-based decoder is proposed for accurately and directly predicting 2D coordinates, removing quantization errors. This paper is the first work that treats the coordinate positioning problem as a regression problem on the Seq2Seq framework.
  • This paper works well on several conventional data sets and can be generalized to video and synthetic data.

3. Related work

3.1 Referring Image Segmentation (RIS)

  Let’s talk about the definition of RIS and the previous workflow. The closest thing to this article is SeqTR, but it can only generate a rough mask of 18 vertices, and cannot outline complex shapes and occlusion masks.

3.2 Referring Expression Comprehension (REC)

  Let’s talk about the definition of REC, the previous two-stage and one-stage methods, and the recent unified framework for multi-task learning (the effect is not good). The performance of the method proposed in this article on multi-tasks improves the performance of RIS and REC.

3.3 Sequence-to-Sequence (seq2seq) Modeling

  Let’s talk about the origin of the seq2seq framework and point out the shortcomings: there are errors in quantification. The PolyFormer proposed in this article treats the coordinate positioning problem as a regression problem and can predict continuous coordinates.

3.4 Contour-based instance segmentation

  Let’s talk about the contour-based method in instance segmentation and point out the shortcomings: it cannot solve the problem of dividing the same target into multiple blocks.

4. PolyFormer

4.1 Structure overview

Insert image description here

4.2 Target sequence construction

4.2.1 Polygon representation

  A mask can be a combination of one or more polygons to outline a reference target. Therefore, each polygon parameter is converted into a sequence set of two-dimensional coordinates: { ( xi , yi ) } i = 1 K , ( xi , yi ) ∈ R 2 \left\{\left(x_{i}, y_{ i}\right)\right\}_{i=1}^{K},\left(x_{i}, y_{i}\right) \in \mathbb{R}^{2}{ (xi,yi)}i=1K,(xi,yi)R2 , where the vertex closest to the upper left corner of the image is used as the starting point of the sequence. As shown below:

Insert image description here

4.2.2 Vertices and special tokens

  Insert <SEP> between two polygon sequences to represent multiple polygons. Multiple polygons from the same target are sorted according to the distance between their vertices and the image origin. Finally, <BOS> and <EOS> are used to indicate the beginning and end of the sequence.

4.2.3 Unified sequence with bounding boxes

  A bounding box can be represented by two corner points, such as the upper left corner point ( x 1 b , y 1 b ) \left(x_{1}^{b}, y_{1}^{b}\right)(x1b,y1b) and the lower right corner point( x 2 b , y 2 b ) \left(x_{2}^{b}, y_{2}^{b}\right)(x2b,y2b) . Then multiple polygons and bounding boxes are spliced ​​together to form a long sequence:
[ < BOS > , ( x 1 b , y 1 b ) , ( x 2 b , y 2 b ) , ( x 1 1 , y 1 1 ) ( x 2 1 , y 2 1 ) , … , < SEP > , ( x 1 n , y 1 n ) , … , < EOS > ] \begin{array}{l} {\left[<\ \mathrm{BOS }>,\left(x_{1}^{b}, y_{1}^{b}\right),\left(x_{2}^{b}, y_{2}^{b}\right) ,\left(x_{1}^{1}, y_{1}^{1}\right)\right.} \left(x_{2}^{1}, y_{2}^{1}\right ), \ldots,<\ \mathrm{SEP}>,\left(x_{1}^{n}, y_{1}^{n}\right), \ldots,<\mathrm{EOS}>] \ end{array}[< BOS>,(x1b,y1b),(x2b,y2b),(x11,y11)(x21,y21),,< SEP>,(x1n,y1n),,<EOS>]其中 ( x 1 n , y 1 n ) \left(x_{1}^{n}, y_{1}^{n}\right) (x1n,y1n) is thennthStarting vertices of n polygons. Usually the corner points of the bounding box and the vertices of the polygon are regarded as coordinate tokens, that is, <COO>.

4.3 Image and text feature extraction

4.3.1 Image encoder

  Use Swin Transformer to extract the image I ∈ RH × W × 3 I \in \mathbb{R}^{H\times W \times 3}IRThe fourth stage characteristic of H × W × 3 F v ∈ RH 32 × W 32 × C in F_v \in \mathbb{R}^{\frac{H}{32}\times \frac{W}{32} \ times C_{in}}FvR32H×32W×Cin, as a visual representation.

4.3.2 Text encoder

  Use BERT to extract the input with LLLinguistic description of L wordsT ∈ RLT\in \mathbb{R}^{L}TRL , get the word featureF l ∈ RL × C l F_l\in\mathbb{R}^{L\times C_l}FlRL×Cl

4.3.3 Multi-modal Transformer encoder

  Level F v F_vFvObtain a set of visual sequence features F v ′ ∈ R ( H 32 ⋅ W 32 ) × C v F_v^{\prime} \in\mathbb{R}^{\left({\frac{H}{32} \cdot \frac{W}{32}}\right)\times C_{v}}FvR(32H32W)×Cv, and use the fully connected layer to convert F v ′ F_v^{\prime}Fvand F l F_lFlProjected to the same embedding space:
F v ′ = F v ′ W v + bv , F l ′ = F l W l + bl F_{v}^{\prime}=F_{v}^{\prime} W_{v }+b_{v}, F_{l}^{\prime}=F_{l} W_{l}+b_{l}Fv=FvWv+bv,Fl=FlWl+bl
Among them W v W_{v}Wv W l W_{l} Wl b v b_{v} bv b l b_{l} blare learnable weights and biases. Then splice F v ′ F_v^{\prime}Fvand F l ′ F_{l}^{\prime}Fl 为: F M = [ F v ′ , F l ′ ] F_M=\left[F_v^{\prime},F_{l}^{\prime}]\right. FM=[Fv,Fl]

  Multi-modal encoder by NNComposed of N Transformer layers, output multi-modal featuresFMN F_M^{N}FMN. In addition, 1D and 2D relative position encoding are also added to image and text features respectively.

4.4 Regression-based Transformer decoder

Insert image description here

4.4.1 2D coordinate embedding

  Create a set of 2D codebooks D ∈ RBH × BW × C e \mathcal{D}\in \mathbb{R}^{B_H\times B_W\times C_e}DRBH×BW×Ce, where BH B_HBHJapanese BW B_WBWare the number of cells along the height and width directions respectively. Then using bilinear interpolation, you can get any floating point coordinates (x, y) ∈ R 2 \left(x,y)\right.\in\mathbb{R}^2(x,y)R2 : First use(x, y) \left(x,y)\right.(x,y)Produces four exact grids: ( x ‾ , y ‾ ) \left(\underline{x},\underline{y})\right.(x,y) ( x ˉ , y ‾ ) \left(\bar{x},\underline{y})\right. (xˉ,y) ( x ˉ , y ‾ ) \left(\bar{x},\underline{y})\right. (xˉ,y) ( x ˉ , y ˉ ) \left(\bar{x},\bar{y})\right. (xˉ,yˉ) ∈ N 2 \in \mathbb{N}^2 N2 , the corresponding index fromD \mathcal{D}D 里面取出: e ( x ‾ , y ‾ ) = D ( x ‾ , y ‾ ) e_{\left(\underline{x},\underline{y})\right.}=\mathcal{D}\left(\underline{x},\underline{y})\right. e(x,y)=D(x,y), and then obtain the precise coordinate embedding e ( x , y ) e_{\left({x},{y})\right.} through bilinear interpolatione(x,y)
e ( x , y ) = ( x ˉ − x ) ( y ˉ − y ) ⋅ e ( x ‾ , y ‾ ) + ( x − x ‾ ) ( y ˉ − y ) ⋅ e ( x ˉ , y ‾ ) + ( x ˉ − x ) ( y − y ‾ ) ⋅ e ( x ‾ , y ˉ ) + ( x − x ‾ ) ( y − y ‾ ) ⋅ e ( x ˉ , y ˉ ) \begin{aligned} e_{(x, y)}= & (\bar{x}-x)(\bar{y}-y) \cdot e_{(\underline{x}, \underline{y})}+(x-\underline{x})(\bar{y}-y) \cdot e_{(\bar{x}, \underline{y})}+ \\ & (\bar{x}-x)(y-\underline{y}) \cdot e_{(\underline{x}, \bar{y})}+(x-\underline{x})(y-\underline{y}) \cdot e_{(\bar{x}, \bar{y})} \end{aligned} e(x,y)=(xˉx)(yˉy)e(x,y)+(xx)(yˉy)e(xˉ,y)+(xˉx)(yy)e(x,yˉ)+(xx)(yy)e(xˉ,yˉ)

4.4.2 Transformer decoder layer

  Same as NNN Transformer decoder layers.

4.4.3 Predictive header

The prediction head consists of a classification head and a coordinate head, and the input is the output QNQ^{N}   of the last layer of the decoder.QN. _ The classification head is a single-layer linear layer, output token type (<SEP>,<COO>,<EOS>):p ^ = W c QN + bc \hat{p}=W_{c} Q^{N}+ b_{c}p^=WcQN+bc, where W c W_{c}Wcand bc b_{c}bcare the parameters of the linear layer.
  The coordinate head consists of 3 layers of FFN, with a RELU activation function between each layer. Predict the bounding box and polygon vertices of the 2D reference target: ( x ^ , y ^ ) = Sigmoid ⁡ ( FFN ( QN ) ) (\hat{x}, \hat{y})=\operatorname{Sigmoid}\left(FFN \left(Q^{N}\right)\right)(x^,y^)=Sigmoid( FFN(QN))

4.5 Training

4.5.1 Polygon enhancement

  A polygon is usually a sparse representation consisting of a dense set of target corner points. Given a dense set of corner points, the resulting sparse polygon is usually not unique. As shown below:

Insert image description here
  Dense corner points are first interpolated from the original polygon outline, and then sparse polygons are produced by uniform downsampling from a set of iterable sample points within a fixed range. This operation thus creates different levels of polygon distribution, from coarse to fine-grained, and training with this prevents the model from overfitting to a fixed polygon representation.

4.5.2 Objectives

  模型来训练并预测下一个 token 及其类型: L t = λ t L c o o ( ( x t , y t ) , ( x ^ t , y ^ t ) ∣ I , T , ( x t , y t ) t = 1 : t − 1 ) ⋅ I [ p t = − < C O ^ > ] + λ c l s L c l s ( p t , p ^ t ∣ I , T , p ^ 1 : t − 1 ) L_t=\lambda_t L_{coo}((x_t,y_t),(\hat{x}_t,\hat{y}_t)|I,T,(x_t,y_t)_{t=1:t-1})\cdot\mathbb{I}[p_t=-<\hat{CO}>]+\lambda_{cls}L_{cls}(p_t,\hat{p}_t|I,T,\hat{p}_{1:t-1}) Lt=ltLcoo((xt,yt),(x^t,y^t)I,T,(xt,yt)t=1:t1)I[pt=<CO^>]+lclsLcls(pt,p^tI,T,p^1:t1) exceptλ t = { λ box , t ≤ 2 , λ poly , otherwise , \lambda_t=\begin{cases}\lambda_{box},&t\leq2,\\\lambda_{poly},&otherwise,\end{cases } }lt={ lbox,lpoly,t2,otherwise, L c o o L_{coo} Lcoois L 1 L_1L1Regression loss, L cls L_{cls}Lclsis L 1 L_1L1Label smoothing cross-entropy loss, I [ ⋅ ] \mathbb{I}[\cdot]I [ ] is the index function. Regression lossL 1 L_1L1Only used to calculate coordinate tokens, λ box \lambda_{box}lboxλ poly\lambda_{poly}lpolyis the corresponding tokens weight. L t L_tLtis the sum of losses for all tokens in a sequence.

4.5.3 Reasoning

  During the inference phase, other tokens are generated by inputting the <BOS> token. First, the token type will be obtained from the classification header. If it is a coordinate token, the coordinate head will obtain the next 2D coordinate prediction based on the previously predicted tokens; if it is a separation token, it represents the end of a polygon prediction, so the separation token will be added to the output. sequence.

  Once <EOS> starts to be output, sequence prediction stops. In the generated sequence, the first and second tokens are the coordinates of the bounding box and the coordinates of the polygon vertices. The final segmentation mask is obtained from these polygons.

5. Experimental results

5.1 Datasets and Metrics

5.1.1 Dataset

  RIS and REC: RefCOCO, RefCOCO+, and RefCOCOg.

5.1.2 Evaluation indicators

  mean IoU、overall IoU、[email protected]

5.2 Implementation details

5.2.1 Model settings

  PolyFormer-B:Swin-B + BERT-base
  PolyFormer-L:Swin-L + BERT-base

5.2.2 Training details

  Perform REC tasks, that is, pre-training, on data composed of Visual Genome, RefCOCO, RefCOCO+, RefCOCOg, and Flickr30k-entities. In the multi-task fine-tuning stage, RefCOCO, RefCOCO+, RefCOCOg training sets are fine-tuned. AdanW, lr=0.00005, 20 epoch pre-training 160batch + 100 epoch fine-tuning 128batch. λ box = 0.1 \lambda_{box}=0.1lbox=0.1 ,λ poly = 1 \lambda_{poly}=1lpoly=1 λ c l s = 0.0005 \lambda_{cls}=0.0005 lcls=0.0005 . Input image size512 × 512 512\times512512×512 , 50% polygon enhancement probability, 2D coordinate embedding codebook set to64 × 64 64\times6464×64 grids.

5.3 Main results

5.3.1 Reference image segmentation

Insert image description here

5.3.2 Reference expression understanding

Insert image description here

5.3.3 Zero-shot migration to reference video object segmentation

Insert image description here

5.4 Ablation experiment

  Ablation experiments were performed on the RefCOCO, RefCOCO+, and RefCOCOg validation sets, using the model PolyFormer-B.

5.4.1 Coordinate classification vs. regression

Insert image description here

5.4.2 Component analysis of target sequence components

Insert image description here

5.5 Visualizing results

5.5.1 Cross attention map

Insert image description here

5.5.2 Prediction visualization

Insert image description here

6. Conclusion

  This paper proposes PolyFormer, a simple and unified sequence framework for RIS and REC. In addition, a novel decoder is designed to generate continuous 2D coordinates without quantization error, and the experimental results are very good.

Limitations and wider implications

  Training PolyFormer requires a large amount of accurately labeled bounding box and polygon vertex labeling data, and using weakly supervised data to achieve regional-level understanding requires further research.

appendix

A. Additional dataset details

  An introduction to the Ref-DAVIS17 dataset.

B. Additional implementation details

  PolyFormer-B 中C in C_vCvThe feature dimension is 1024, and C v C_v in PolyFormer-LCvThe feature dimension is 1536. Linguistic features C l C_lCland coordinate embedding C e C_eCeThe dimensions are all 768; the attention heads are 12; GELU activation is used in each layer of Transformer; for F cls F_{cls}Fcls, the label smoothing factor is set to 0.1.

C. Additional experimental results

Insert image description here

D. More visual results

D.1 Cross attention map

Insert image description here

D.2 Visualization of prediction results

Insert image description here

write on the back

  The novelty of this article is quite good, and it is good at breaking the rules (maybe because the funds are in place, haha). The polygon data enhancement techniques and token splicing techniques are worth learning.

  PS: Continue tomorrow! Welcome to follow~

Guess you like

Origin blog.csdn.net/qq_38929105/article/details/129306699