ICCV2021: TextBPN - "Adaptive Boundary Proposal Network for Arbitrary Shape Text Detection" paper reading notes

Original link: https://arxiv.org/abs/2107.12664

Source link: https://github.com/GXYM/TextBPN

foreword

For text detection of arbitrary shapes in natural scene images , there are still two problems in segmentation-based methods: one is that adjacent text instances cannot be effectively separated and complex post-processing is required; the other is that segmentation-based methods rely on Due to the accuracy of contour detection, there are many defects and noises in the detected contour. Therefore, this paper proposes an adaptive candidate boundary network for arbitrary shape text detection: the author proposes to first obtain the thick border of the text instance (which will be slightly smaller than the real text area) to solve the problem of text instance sticking, and at the same time design the boundary automatically Adapt to the adjustment network, so that the thick border adjustment is iteratively refined, and finally close to the real border.


1. Method design

1. Network structure

 Figure 1 TextBPN network structure diagram

The network structure includes three parts: a        similar feature pyramid structure formed by using ResNet-50 as the backbone network to generate Fs (the structure is not shown in Figure 1), a boundary proposal network and an adaptive boundary deformation network:

1) Multi-layer feature fusion strategy: the multi-layer convolution of the backbone network is fused into Fs through upsampling and splicing ;

2) Frame suggestion module: It consists of multi-layer hole convolution, including two 3 x 3 convolution layers with different hole rates and a 1 x 1 convolution layer, and generates classification maps, distance field maps and Direction field diagram, that is, the prior information Fp part;

3) Adaptive boundary deformation network: The boundary topology and sequence context are learned through GCN and RNN, and the thinning of the thick border is completed through iteration.

(1) Multi-layer feature fusion strategy

The deep features are upsampled to the same size as the previous feature for cat operation. The specific network structure of this module is shown in Figure 2:

Figure 2 Similar to the FPN network structure diagram

 (2) Border proposal module

The classification map, distance field map and direction field map are obtained through multi-layer hole convolution, as shown in Figure 3

The classification map contains the classification confidence for each pixel (text/non-text)

The orientation field map ( \small V) consists of a two-dimensional unit vector \small \left ( \vec{x},\vec{y} \right ), shown in Figure 3(c), representing the direction from each text pixel within the boundary to the nearest pixel on the boundary, and for \small \mathbb{T}each pixel in the text instance \small p, on the text \small \mathbb{T}box Find the nearest text boundary pixel \small B_{p}, then compute a unit vector for each pixel \small V_{gt}\left ( p \right ), \small \mathbb{T}pixels outside the text instance are set to in the orientation field \small \left ( 0,0 \right ).

\small V_{gt}\left ( p \right )=\left\{\begin{matrix} \overrightarrow{B_{p}p}/\left | \overrightarrow{B_{p}p} \right |,\; \; \; p\in\mathbb{T} \\ \left ( 0,0 \right ),\; \; \; \; \; \; \; \; \; \; \; \; p\notin \mathbb{T} \end{matrix}\right.

Distance Field Map ( \small D) A normalized distance map, that is, the normalized distance from the text pixel to the nearest text boundary pixel found on pthe text box , and the pixels outside the text instance are set in the distance field . where represents the scale of the text instance where the pixel is located .\small \mathbb{T}\small B_{p}D_{gt}\left ( p \right )\small \mathbb{T}\small 0\small L\small p\small \mathbb{T}

\small D_{gt}\left ( p \right )=\left\{\begin{matrix} \left | \overrightarrow{B_{p}p} \right |/L,\; \; \; p\in \mathbb{T}\\ \; \; \; \; \; 0,\; \; \; \; \; \; \; \; \; \; p\notin \mathbb{T} \end{matrix}\right. \; \; \; \; \; \; \; \; \; \; \; \; \; \; \;(1)

\small \L =max\left (D _{gt}\left ( p \right ) \right )\; \; \; \; \; \; \; \; \; \; \; \; \;\; \; \; \; \; \; \;\; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; \; (2)

 Figure 3. Prior information feature map display

In the frame proposal module, with the distance field map ( \small D) , a candidate frame proposal can be generated by setting a fixed threshold\small th_{d} . In Figure 4, the original image (a) obtains possible text boxes through the distance field map, but there are Wrong detection, as shown in Figure (b), calculates the average confidence of each candidate frame according to the classification map , discards when the obtained Proposal score is lower than the set confidence threshold\small th_{s} , and finally obtains all suggested texts frame.

 Figure 4 Schematic flow chart of generating suggested text boxes

(3) Adaptive boundary deformation module

The main function of this module is to learn through the topological structure and sequence context in the text box, and to iteratively refine and adjust the obtained thick borders to obtain real text box instances (personal understanding is similar to the post-processing function). Part of the structure is mainly to introduce GCN and RNN in the encoder part. At the same time, a branch uses a 1 x 1 convolutional layer to form a ResNet-like residual structure, as shown in Figure 5. Finally, the decoder part uses There are three layers of 1 x 1 convolution of ReLU. In order to refine the candidate frame, the paper uses iterative processing (the module code is cyclically spliced ​​3 times in the source code).

Figure 5 Structure diagram of adaptive deformation module

After getting the suggested candidate frame, coordinate points need to be obtained. In this paper, the proposed candidate frame is used to select the boundary of the candidate frame, and divided into 20 parts of equal length according to the perimeter, and 20 coordinate points are obtained respectively as the coordinates of the candidate frame . point . (In the training of the source code, iterative training is performed by marking the 20 coordinate points of the suggested candidate frame generated by the text box)

As shown in Figure 6, the Node feature matrix needs to be generated through the coordinate points as the input of the adaptive deformation module. The specific operation is as follows: In Figure 2, it can be seen that the 32-D shared feature Fs obtained by the CNN backbone network and The 4-D prior features obtained by multi-layer hole convolution are concat together to form cnn_feature , ie F. At the same time, the features of each control point (coordinate point) are extracted by combining the corresponding positions of the 20 coordinate points in F , and finally the candidate boundary feature matrix X (size: N x C)f_{i}=concat\left ( {F_{s}} \left ( x_{i}, y_{i}\right ),{F_{p}} \left ( x_{i}, y_{i}\right )\right ) is obtained .

Figure 6 Schematic diagram of the entire adaptive candidate frame deformation network process

2. Loss function

The loss function of the network is defined as

\L =\L _{B_{p}}+\frac{\lambda \ast \L _{B_{d}}}{1+e^{\left ( i-eps \right )/eps}}

Among them \L _{B_{p}}is the frame proposal loss, \L _{B_{p}}the loss of the adaptive boundary deformation model, eps represents the maximum number of epochs for training, and \lambdais set to 0.1.

\L _{B_{p}}Pixel classification loss including cross-entropy classification loss \L _{cls}, and distance loss of regression loss \L_{D}and loss of L2-normal distance and angular distance in the direction field \L _{V}, and \alpha=3:

\L _{B_{p}}=\L _{cls}+\alpha \ast \L _{D}+\L _{V}

\L _{B_{d}}For the point matching loss, it is mainly to calculate the loss between the predicted point and the true value point. The loss of each text instance is \small \L \left ( p,p^{'} \right ), because there are multiple text instances in an image, so the average loss needs to be calculated:

\small \L _{B_{d}}=\frac{1}{\mathbb{T}}\sum_{i=0}^{N-1}\L \left ( p,p^{'} \right )


2. Experimental results

1. Ablation experiment

(1) Adaptive candidate frame deformation module

Experiments were carried out on Total-Text and CTW1500, and four different types of encoders were used: FC sum \small Conv_{1\times 1}, RNN, circular convolution, graph convolution (GCN) for experiments. The experimental results are shown in Table 1. The adaptive Deformation network channels work best. 

(2) Number of control points

This experiment is mainly to explore how many control points are used to generate the suggested text box. The number of control points is set at 12~32, and the interval is 4. It is also evaluated on Total-Text and CTW1500. The specific results are shown in Figure 7 As shown, the best effect is achieved when the number of control points is about 20, so the number set in the paper is also 20 control points.

Figure 7 Experimental results of the number of control points

 (3) The influence of the number of iterations

In order to fully verify the influence of the number of iterations, the author conducted experiments on models under different iterations, and the results are shown in Table 2. When the number of iterations increases, the detection effect is better, but the reasoning speed will decrease. When the number of iterations is 3, the effect improvement is not obvious. In order to balance the speed and performance, the author finally sets the number of iterations to 3 by default.

 At the same time, the author also gives the result map of the predicted text box during the iteration process, in which the blue text box is the suggested text box (thick text box), and the green is the predicted text box from each iteration, as shown in Figure 8.

 Figure 8 Visual display of iteration results

(4) Influence of prior information

In the frame suggestion module, the classification map, distance field and direction field are generated as prior information to guide the iterative transformation of the adaptive candidate frame deformation module. The results are shown in Table 3. The added prior information has a significant effect on performance improvement. great help.

 (5) Different FPN resolutions

Mainly tested the use of FPN-P1 (1/1), FPN-P1 (1/2) and FPN-P2 (1/4), which respectively represent the features that use the P1 feature in FPN to get the same size as the original image after upsampling , and P1 features that have not been upsampled (1/2 the size of the original image) and P2 features (1/4 the size of the original image), the results are shown in Table 4.

2. Performance comparison

Total-Text

 CTW-1500

MSRA-TD500






Summarize

In this paper, an adaptive boundary proposal network for arbitrary shape text detection is proposed. The boundary proposal model is used to generate rough boundaries, and then the adaptive boundary deformation model combined with GCN and RNN is used to iteratively deform the boundaries to refine the thick borders. Get more accurate shape of text instance.

The above is just a summary of the blogger’s own experience in reading papers, some of which are not necessarily accurate. If there are any mistakes, criticism and discussion are welcome.

At the same time, everyone is welcome to discuss and learn about the reading and understanding of the source code, so as to help each other and make progress together.

Guess you like

Origin blog.csdn.net/kb16045125/article/details/121851405