Paper Interpretation|[AAAI2023]DPText-DETR: Towards Better Scene Text Detection with Dynamic Points in Transformer

motivation

Contour-based detection methods (predicting points of polygons, Bezier curve control points) may achieve sub-optimal training efficiency and performance due to coarse location query modeling. The form of dot labels used implies human reading order, hampering the detection robustness we observed. Based on these two problems, the author proposes a concise dynamic point text detection converter network: DPText-DETR

Method overview

1. The image is input to the CNN backbone + Transformer encoder for feature extraction , and K axis-aligned boxes are generated in the final encoding layer (K ​​is equal to the number of text instances in the input image), and the center points of these axis-aligned boxes are used. and scale information, a certain number of initial control point coordinates are uniformly sampled at the top and bottom , used as suitable reference points for the deformable cross-attention module.
2. Encode the sampled control point coordinates in the decoder and add them to the corresponding control point content query to form a composite query . The compound queries are first sent to EFSA to further mine their relative relationships, and then fed to the deformable cross-attention module, and then the control point coordinate prediction head is used to dynamically update the reference points layer by layer. The prediction head is used to generate the class confidence score and N control point coordinates for each text instance.

Explicit Point Query Modeling(EPQM)

EPQM includes: prior point sampling and point updating .

a. A priori point sampling .

TESTR model : Propose a solution to convert axis-aligned boxes (rectangular boxes) into polygonal boxes

In the last layer of the encoding layer, K axis-aligned boxes are generated by the top K proposal generators (K is equal to how many text instances there are in a picture). These axis-aligned boxes are queried by the content of N control points. shared.

Therefore, the compound query in TESTR can be expressed asQ^{i}(i=1, .., K)

Q^{i} = P^{i} + C = \varphi ((x, y, w, h)^{(i)}) + (p_{1}, ..., p_{N})

Among them, P represents the position part of the compound query, and C represents the content part of the compound query. \varphi is the sinusoidal position encoding function followed by linear and normalization layers. (x, y, w, h) represents the center coordinates and size of each axis-aligned box. (p_{1}, ..., p_{N}) are N learnable control point content queries shared by K compound queries. In this method of TESTR , different control point content queries share the same axis-aligned box prior information, which does not match the point target to a certain extent. Content queries lack their respective display positions and cannot be used in the box area.

In response to the above problems, the author of this article has made improvements.

In the match query formula, (x, y, w, h) is not used. Instead, N/2 coordinate points (i.e., initial control points/prior points) are evenly sampled at the top and bottom of each axis-aligned box. The sampling formula is as follows:

(point_{1}, ..., point_{n}) is the position prior, which shows the position. The complete display point formula for generating a compound query is as follows:

Q^{i} = P^{i} + C ={\varphi }'((point_{1}, ..., point_{N})^{(i)}) + (p_{1}, ..., p_{N})

In this way, N control point content queries share their respective display position priors .

the difference :

\bullet In TESTR's compound query  Q^{i}, each control point content query P_{i}  adds (x, y, w, h); in this article, what P_{i} is added ispoint_{i}

\bullet TESTR generates position queries ( explicit positions ) by axis-aligning boxes (center and size) , which is harder to refine because there is only one center point.

\bullet DPText uses explicit point coordinates (N points sampled at the upper and lower boundaries) to generate position queries ( explicit positions ). This method is easy to refine, because there are N explicit points, and coordinate offsets can be predicted for them through the prediction head.

b. Click Update .

The author of this article refines the explicit point coordinates layer by layer at the decoder layer, and sends the updated explicit points as new reference points to the deformable cross-attention module.

Enhanced Factorized Self-Attention(EFSA)

In the decoder, after having query input, it is usually necessary to consider how to mine the relationship between queries. In previous work, the self-attention mechanism was first used to mine intra-instance relationships for different points within the same instance, and secondly, inter-instance relationships were constructed on the dimensions representing different instances . Although this kind of relationship modeling (called Factorized Self-Attention, FSA) covers the relationship within and between instances, it lacks explicit modeling of the spatial inductive bias of different control points within the instance . For the text representation of polygons, it can be observed that the polygon control points of the text present an obvious closed ring, because the author introduces ring convolution in parallel with intra-instance self-attention to provide ring guidance for display, and introduces more priors to fully Mining the relationship between different control point queries within the instance. Enhanced intra-instance relationship modeling and inter-instance relationship modeling together constitute the EFSA module.

1. Utilize intra-group self-attention ( ) of  N subqueries of each compound query  to capture the relationship between different points in the same text instance .Q^{(i)}(p_{1}, ..., p_{N})SA_{intra}

2. Use inter-group self-attention ( SA_{inter}) of K compound queries to capture the relationship between different text instances .

The authors speculate that non-locality  SA_{intra} is insufficient in capturing circular priors for polygonal control points. Therefore, local circular convolution is used to supplement FSA to form EFSA.

Execute SA_{intra}, get query Q_{intra} = SA_{intra}(Q)

local enhanced query Q_{local} = ReLU(BN(Circunv(Q)))

Fusion query  Q_{fuse} = LN(FC(C + LN(Q_{intra} + Q_{local}))), where FC is the fully connected layer; BN represents BatchNorm; LN represents LayerNorm; C is the query content.

Execution SA_{inter}, mining relationships between different instances:Q_{inter} = SA_{inter}(Q_{fuse})

Guess you like

Origin blog.csdn.net/qq_44950283/article/details/135139510