Classic model of big talk text detection: SegLink

In natural scenes, such as light box billboards, product packaging boxes, trademarks, etc., to detect the text in it will face various complex situations, such as angle tilt, deformation, etc. At this time, it is necessary to use deep learning-based methods. Perform text detection. In the previous article, the CTPN text detection method based on convolutional neural network and cyclic neural network was introduced (see the article: classic model CTPN for big talk text detection) . This method can better detect text in natural scenes. However, the text detection effect given in CTPN is based on the horizontal direction, which is not good for non-horizontal text detection. In natural scenes, a lot of text information has a certain rotation angle, such as using a mobile phone Take a photo of the signs on the street, as shown below. If the result of text detection is only in the horizontal direction without angle information, then the result of the sign detected in the following figure is the result of the red frame, but in fact the green frame is the ideal detection target, which shows that the error of the detection result is too large.

 

How can we achieve flexible detection of various angles? One of the most direct ideas is to let the model not only learn and output the position of the border (x, y, w, h), but also output the rotation angle parameter θ of a text box. The text detection model SegLink introduced in this article adopts this idea, that is, the SegLink detection model can detect text with a rotation angle, as shown in the following figure:

1. The main idea of ​​SegLink model

The detection process of the SegLink model is mainly as follows:

1. The first is to detect and generate a segment (slice) one by one, as shown in the yellow box above. These segments (slices) are part of a text line (or word), which may be a character, or a word, or several characters.

2. Connect the segments (slices) belonging to the same text line (or word) through the link, as shown in the green line above. The link is connected at the center point of two overlapping segments, as shown in the figure below

3. Through the merging algorithm, these segments (slices) and links (links) are merged into a complete text line, and the detection frame position and rotation angle of the complete text line are obtained.

 

Among them, segment (slice) and link (link) are the innovations of the SegLink model. The model not only learns the location information of segments, but also learns the link relationship between segments to indicate whether they belong to the same text line (or word) .

 

Second, the network structure of the SegLink model

The network structure of the SegLink model is as follows:

The model uses VGG16 as the main backbone of the network, and replaces the fully connected layers (fc6, fc7) with convolutional layers (conv6, conv7), followed by 4 convolutional layers (conv8, conv9, conv10, conv11) , among them, the feature maps (feature maps) of the 6 layers of conv4_3, conv7, conv8_2, conv9_2, conv10_2, and conv11 are taken out for convolution to obtain segments (slices) and links (links). The size of the feature map (feature map) of these 6 layers is different, and the size of each layer is only half of the previous layer. Obtaining the segment and link from these 6 layers of different sizes can realize text lines of different sizes. (large feature maps are good at detecting small objects, and small feature maps are good at detecting large objects).

1. Segment detection

The whole architecture adopts the idea of ​​SSD. In segment (slice) detection, similar to the SSD model detection process, the results are regressed by means of "boxing", and each feature map (feature map) is output after convolution. The number of channels is 7, two of which indicate whether the segment is a text with a confidence value of (0, 1), and the remaining five are the five offsets of the segment relative to the default box of the corresponding position. Each segment is represented as:

2. Link detection

In terms of segment and segment link (link), there are mainly two cases, one is intra-layer link detection, and the other is cross-layer link detection. As shown below:

Among them, the intra-layer link detection indicates the connection status of each segment and the segments in the 8 neighborhoods of the same feature layer. Each link has two scores: positive score and negative score. The positive score indicates that the two belong to the same text (should be connected ); a negative score means the two belong to different texts (should be disconnected). The cross-layer link detection is mainly to solve the problem that the segment of the same text is detected in different layers, resulting in repeated detection and redundancy. In addition to the neighbors of the layer, the previous layer also has its neighbors, but the latter layer is not the neighbor of the previous layer, and this redundancy will be eliminated in the subsequent merging algorithm.

3. Merge algorithm

The idea of ​​the merge algorithm is as follows:

  • Take out the segment of the same row
  • Perform a least squares linear regression on the center points of these segments to get a straight line
  • The center point of each segment is projected vertically to this line
  • Take out the two points with the farthest distance from all projected points, denoted as (xp, yp), (xq, yq)
  • Then the final merged text box, (1) the center point position is ( (xp+xq)/2 , (yp+yq)/2 ), (2) the width is the two farthest points (xp, yp), ( xq, yq) plus half of the segment width (Wp/2 + Wq/2), (3) the height is the average height of all segments

 

As shown in the figure below, the orange line in the middle represents the straight line after the least squares regression, the red point represents the center point of the segment, the yellow point represents the vertical projection of the red point on the line, and the green frame is the complete text box after the above merging algorithm processing. .

 

3. Summary

SegLink adds angle detection, which is very robust to text detection of various angles, while CTPN is mainly used to detect horizontal text lines, as shown in the following figure:

However, this model also has shortcomings. For example, it cannot detect text lines with large intervals, because adjacent segments are mainly connected by links, and the effect will not be good when the texts are too far apart. In addition, deformation or curve text cannot be detected. This is because the linear regression method is used in the final merge algorithm, which can only fit straight lines and cannot fit curves. However, it is also possible to modify the merge algorithm. Detection of curved text.

 

Wall Crack Advice

In 2017, Baoguang Shi et al. published the classic paper "Detecting Oriented Text in Natural Images by Linking Segments" on SegLink, which introduced the technical principle of SegLink in detail in the paper. It is recommended to read the paper to learn more about the model.

 

Follow my official account "Big Data and Artificial Intelligence Lab" (BigdataAILab), and then reply to the keyword " thesis " to read the content of classic papers online .

 

Recommended related reading

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324144578&siteId=291194637