REC Series Visual Grounding with Transformers Paper Reading Notes


write in front

  Hello, another week has passed and school is about to start. I wonder if your friends have adjusted themselves? Come on~
  This is also an article about REC. The article is earlier, but it is also a must-read article for beginners.

1. Abstract

  This paper proposes a Transformer-based method for visual positioning. Unlike existing methods that first extract proposals and then sort them, which rely heavily on pre-trained target detectors, or proposal-less framework methods, a set of offline single-stage detectors are updated by fusing textual embeddings. The method proposed in this article, Visual Grounding with TRansformers VGTR, is built on the Transformer framework, independent of pre-trained detectors and word embedding, and is used to learn semantically discriminative visual features. The experiment achieved SOTA performance.

2. Introduction

  First, point out the definition, application and difficulties of visual positioning. Early approaches viewed visual localization as a special case of text-based image retrieval and framed it as the task of retrieving a referent from a set of candidate regions in a given image. These methods rely heavily on pre-trained object detectors and often ignore the visual context of the object. Furthermore, most methods require additional computational cost to generate and process candidate proposals.
  Some recent work removes the generation process of Proposals and directly locates the target, but the visual and textual features are still independent of each other. To alleviate this problem, this paper proposes an end-to-end Transformer-based network VGTR, which can capture global visual and language context without generating target proposals. Compared with those methods based on offline detectors or grid features, VGTR treats visual localization as a coordinate regression problem based on the target bounding box of the query sentence.

Insert image description here
As shown in the figure above, VGTR consists of four modules: a basic encoder, used to calculate basic image-text pair tokens; a dual-stream positioning encoder, used to perform joint reasoning and cross-modal interaction of visual language; a positioning decoding The processor treats text tokens as positioning queries and proposes target-related features from the encoded visual tokens; a prediction head is used to perform coordinate regression of the bounding box. In addition, a new self-attention mechanism is designed to replace the original one and applied to visual tokens, establishing the association between the two modalities and learning text-guided visual features without reducing the positioning ability.

  The contributions are summarized below:

  • An end-to-end framework VGTR is proposed for visual localization without the need for pre-trained detectors and pre-trained language models;
  • A text-guided attention module is proposed to learn visual features guided by language descriptions;
  • The method reaches SOTA.

3. Related work

3.1 Visual positioning

  This article divides visual positioning into two categories: proposal-rank and proposal-free. The former first generates a set of candidate target proposals from the input image through an offline detector or proposal generator, then associates language descriptions and scores these candidate proposals, and selects the one with the highest score as the positioning target. These methods rely heavily on the performance of pre-trained detectors or proposal generators.
  Proposal-free methods focus on directly locating the referent target, and have great potential in terms of accuracy and reasoning speed.

3.2 Visual Transformer

  Transformer has been very popular in target detection and image segmentation recently. The DETR series transforms visual feature maps into a set of tokens to achieve SOTA.

4. Method

4.1 Basic visual and text encoders

  Given an image and a reference expression pair ( I , E ) (I,E)(I,E ) , the visual localization task aims to use the Bounding box to locate the target instance described by the target expression.
  First resize the image tow × hw\times hw×h , and then sent to the ResNet backbone to extract the image feature mapF ∈ R ws × hs × d F\in \mathbb{R}^{\frac ws\times\frac hs \times d}FRsw×sh× d , wheresss is the step size of backbone output,ddd is the number of channels. Then the visual feature mapFFF is transformed into visual tokensX v = { vi } i = 1 T v X_v=\{ { { v}_i}\}_{i=1}^{T_v}Xv={ vi}i=1Tv,其中 T v = w s × h s T_v=\frac ws\times\frac hs Tv=sw×shis the number of tokens, vi v_iviThe dimensions are ddd .
  Use RNN-based soft-parser to extract text tokens: for the given expressionE = { et } t = 1 TE=\{e_t\}^{T}_{t=1}E={ et}t=1T, where TTT represents the number of words. First, use the learnable embedding layer, that is,ut = Embedding ( et ) u_t=\text{Embedding}(e_t)ut=Embedding(et) will each wordet e_tetConvert to vector ut u_tut. Bidirectional LSTM is then applied to encode the context of each word. Then calculate the kthk text tokens attttht 个单词上的注意力:
h t = Bi-LSTM ⁡ ( u t , h t − 1 ) a k , t = exp ⁡ ( f k T h t ) ∑ i = 1 T exp ⁡ ( f k T h i ) \begin{aligned} h_{t}& =\operatorname{Bi-LSTM}(u_{t},h_{t-1}) \\ a_{k,t}& =\frac{\exp(f_k^Th_t)}{\sum_{i=1}^T\exp(f_k^Th_i)} \end{aligned} htak,t=Bi-LSTM(ut,ht1)=i=1Texp(fkThi)exp(fkTht)After kkthk text tokens are defined as the sum of weights of word embedding:
lk = ∑ t = 1 T ak , tut \boldsymbol{l}_k=\sum_{t=1}^Ta_{k,t}\boldsymbol{u}_tlk=t=1Tak,tutThe final text tokens are represented as X l = { lk } k = 1 T lXl={ lk}k=1Tl, where T l T_lTlis the number of tokens, lk l_klkThe dimensions are ddd

4.2 Positioning encoder

  The localization encoder consists of stacked NNComposed of N independent layers, each layer contains two independent visual + language branches for processing visual and text tokens. Like the Transformer layer, each branch contains three sub-layers: Norm Layer, multi-head self-attention layer, and fully connected forward layer FFN.

Self-attention text branch

  Given queries ql q_lql, from ii .Text tokens of layer i X li X_l^iXliGet keys kl k_lkl 和 values v l v_l vl, the output of the text self-attention layer is:
T-Attn ⁡ ( ql , kl , vl ) = softmax ⁡ ( qlkl T d ) ⋅ vl \operatorname{T-Attn}(\boldsymbol{q}_l,\boldsymbol{k }_l,\boldsymbol{v}_l)=\operatorname{softmax}\left(\frac{\boldsymbol{q}_l\boldsymbol{k}_l^T}{\sqrt d}\right)\cdot\boldsymbol{ v}_lT-Attn(ql,kl,vl)=softmax(d qlklT)vlThen apply FFN, defined as FFN l \text{FFN}_lFFNlGet text features X li + 1 X_l^{i+1}Xli+1
X l i + 1 = FFN l  (T-Attn ( q l , k l , v l ) ) X_l^{i+1}=\text{FFN}_l\text{ (T-Attn}(q_l,k_l,v_l)) Xli+1=FFNl (T-Attn(ql,kl,vl))

Text-guided visual branch of self-attention

  The structure of the visual branch is similar to the text branch, but there is an additional component called text-guided self-attention, which aims to extract salient visual features guided by text descriptions. Specifically, given queries qv q_vqv, from section iiVisual tokensX vi X_v^ i of layer iXviGet keys kv k_vkv 和 values v v v_v vv. Next, the text feature X li + 1 X_l^{i+1}Xli+1Supplement visual queries as additional guidance information. For smooth implementation, use specific tokens to pair text tokens X li + 1 X_l^{i+1}Xli+1Sum the weights and add visual queries qv q_vqv. where the weights pass qv q_vqv X l i + 1 X_l^{i+1} Xli+1 的点乘得到:
V − A t t n ( q ^ v , k v , v v ) = s o f t m a x ( q ^ v k v T d ) ⋅ v v q ^ v = q v + s o f t m a x ( q v ( X l i + 1 ) T d ) ⋅ X l i + 1 \mathrm{V-Attn}(\hat{\boldsymbol{q}}_v,\boldsymbol{k}_v,\boldsymbol{v}_v)=\mathrm{softmax}\left(\frac{\hat{\boldsymbol{q}}_vk_v^T}{\sqrt{d}}\right)\cdot\boldsymbol{v}_v \\ \\ \begin{aligned}\hat{q}_v&=\boldsymbol{q}_v+\mathrm{softmax}\left(\frac{\boldsymbol{q}_v(X_l^{i+1})^T}{\sqrt d}\right)\cdot X_l^{i+1}\end{aligned} VAttn(q^v,kv,vv)=softmax(d q^vkvT)vvq^v=qv+softmax(d qv(Xli+1)T)Xli+1Similarly, apply FFN, expressed as FFN v \text{FFN}_vFFNvGet visual tokens X vi + 1 X_v^{i+1}Xvi+1
X v i + 1 = F F N v ( V − A t t n ( q ^ v , k v , v v ) ) X_v^{i+1}=\mathrm{FFN}_v(\mathrm{V-Attn}(\hat{q}_v,k_v,v_v)) Xvi+1=FFNv(VAttn(q^v,kv,vv))
The following figure is a schematic diagram of cross-modal attention and the text-guided attention mechanism proposed in this article:

Insert image description here
  Queries of a typical multi-modal self-attention mechanism originate from one modality, while Keys and Values ​​originate from another modality to perform standard self-attention operations, similar to the self-attention operations in the Transformer decoder. However, integrating text information into image features in this way may damage positioning capabilities, so this paper proposes to guide visual features through text tokens to achieve higher performance.

4.3 Positioning decoder

  The decoder is also composed of NNIt is composed of N independent layers stacked. Each layer has 4 sub-layers: Norm Layer, positioning query self-attention layer, grounding query self-attention layer, encoder-decoder self-attention layer, and fully connected feed-forward (FFN) layer.
  The input of the positioning decoder is the modified text tokensX l N X_l^NXlN, followed by positioning queries GGG services, in addition to visual tokensX v N X_v^NXvNParticipation. Under the guidance of positioning queries, positioning query self-attention, and encoder-decoder self-attention mechanisms, text-guided visual features are decoded.

Positioning query self-attention

  Given queries qg q_gqg, from ii .Positioning queries for layer i G i G^iGObtain keyskg k_g from ikg 和 values v g v_g vg. Then apply the standard self-attention mechanism to perform query enhancement:
G-Attn ( qg , kg , vg ) = softmax ( qgkg T d ) ⋅ vg \text{G-Attn}(q_g,k_g,v_g)=\text{softmax }\left(\frac{q_gk_g^T}{\sqrt{d}}\right)\cdot v_gG-Attn(qg,kg,vg)=softmax(d qgkgT)vgThe modified positioning queries are then obtained through layer normalization (LN):
G i + 1 = LN ( G − A ttn ( qg , kg , vg ) ) G^{i+1}=\mathrm{LN} (\mathrm{G-Attn}(q_g,k_g,v_g))Gi+1=LN ( GAttn ( qg,kg,vg))

Encoder-decoder self-attention

  The encoder-decoder self-attention will locate queries G i + 1 G^{i+1}Gi + 1 is treated as queriesfgq f_g^qfgq, from the encoded visual tokens X v N X_v^NXvNGet keys fvk f_v^kfvk 和 values f v v f^v_v fvvAs input, output the extracted text-related features:
ED-Attn ⁡ ( qg , kv , vv ) = softmax ⁡ ( qgkv T d ) ⋅ vv \operatorname{ED-Attn}(q_g,k_v,v_v)=\operatorname{ softmax}\left(\frac{q_gk_v^T}{\sqrt d}\right)\cdot v_vED-Attn ( qg,kv,vv)=softmax(d qgkvT)vvFinally, FFN is applied, denoted as FFN ed \text{FFN}_{ed}FFNedGet the final embeddings ZZZ
Z = FFN e d ( ED-Attn ( q g , k v , v v ) ) Z=\text{FFN}_{ed}(\text{ED-Attn}(q_g,k_v,v_v)) Z=FFNed( ED-Attn ( qg,kv,vv))

4.4 Prediction head and training target

  This paper treats the referential target positioning task as a bounding box coordinate regression problem. Get the transformed embedding from the positioning decoder Z = { zi } i = 1 K ∈ RK × d Z=\{z_{\boldsymbol{i}}\}_{i=1}^{K}\in\ mathbb{R}^{K\times d}Z={ zi}i=1KRK × d , then concatenate all transformed vectors, and then use the prediction head to return the coordinates of the center point and the width and height of the target. The prediction head consists of two fully connected layers followed by a ReLU activation layer.
  The training objectives are L1 loss and general IoU (GIoU) ​​lossL iou ( ⋅ ) \mathcal{L}_{iou}(\cdot)Liou( ) L oss = λ L 1 ∣ ∣ b − b ^ ∣ ∣ 1 + λ L iou L iou ( b , b ^ ) Loss=\lambda_{L_1}||b-\hat{b} ;
||_1+\lambda_{L_{iou}}\mathcal{L}_{iou}(b,\hat{b})Loss=lL1∣∣bb^1+lLiouLiou(b,b^ )whereb ^ \hat bb^ represents the bounding box of the predicted target,bbb is GT,λ L 1 \lambda_{L_1}lL1 λ L i o u ∈ R \lambda_{L_{iou}}\in \mathbb{R} lLiouR is a hyperparameter that balances the two losses.

5. Experiment

5.1 Dataset

  Flickr30k Entities、RefCOCO/RefCOCO+/RefCOCOg。

5.2 Implementation details

Hyperparameter settings

  Input image size 512 × 512 512\times512512×512 , the maximum sentence length is 20, and the output step size of Backbone iss = 32 s=32s=32 . For all datasets, 4 text tokens are extracted. The number of heads in multi-head attention is 8, and the hidden layer sized = 256 d=256d=256 , the number of layers of VGTRis N = 2 N=2N=2λ L 1 = 5 \lambda_{L_1}=5lL1=5 ,λ L iou = 2 \lambda_{L_{iou}}=2lLiou=2.

Training and evaluation details

  AdamW optimizer, initial learning rate 1 e − 4 1e-41 e4 , weight decay1 e − 5 1e-51 e5. CNN backbone: ResNet50/101. The initialization weights adopt the weights pre-trained on MSCOCO data. A total of 120 epochs are trained. At the 70th and 90th epochs, the learning rate decreases by10% 10\%10% . Using [email protected] as the evaluation index,

5.3 Comparison with SOTA method

  
Insert image description here

5.4 Ablation experiment

Contribution of each part

  
Insert image description here

Effectiveness of text-guided self-attention

  
Insert image description here

number of layers

  Same table 3.

5.5 Qualitative analysis

  
Insert image description here

6. Conclusion

  This paper proposes a single-stage Transformer-based framework VGTR for visual positioning tasks, and experiments show that the method is effective.

write on the back

  This paper is relatively short, but it can be regarded as a good application and improvement of Transformer in 2021. Now it looks like the effect is not that explosive. Moreover, the author's writing skills in the paper also need to be improved. Many concepts are just clicked without going into depth. In addition, a part of the chapter is wasted on the Transformer structure, which is not appropriate.

Guess you like

Origin blog.csdn.net/qq_38929105/article/details/132360484