Introduction to Text Detection Algorithm EAST

EAST is an article on text detection in natural scenes published by Questyle in CVPR in 2017. EAST is an idea used to solve the problem of multi-directional text detection. Its core idea is reflected in the following points.

  1. A multi-scale fusion method such as FCN is used to extract features for subsequent pixel-level text region prediction.
  2. EAST can directly achieve the purpose of oblique text detection, and can complete the task of text detection in natural scenes. Supports two types of annotations for text areas: a rotating rectangle frame and an arbitrary quadrilateral. In other words, when EAST returns to the text area, it includes two different area detection processes, such as rotating the rectangular frame, the rectangular frame and the rotation angle, or any quadrilateral.
  3. Since the orientation information is considered, text in various orientations can be detected.
  4. Due to the problem of receptive field, it is not effective for long text detection.
  • EAST model network structure

In the above figure, we can see that it mainly adopts the idea of ​​FPN to extract multi-scale fusion features. After the feature extraction of the original image has been carried out until the size of 7*7, it will be upsampled separately, and it will be upsampled to the same feature map as 7*7. In this process, features of different scales are fused. Here we use The concat method connects multi-scale features. After the multi-scale feature is obtained, it is used as the input of the subsequent output layer. The whole here is the structure of an FPN. There are two different outputs in the output layer, and these two different outputs correspond to the rotating rectangular box (1 score map+4 regression boxes+1 angle information) and any quadrilateral (1 score map+8 coordinates) information)

  • EAST tag

In the above figure, for a picture, it is a text area, and for the yellow area, it represents the original text rectangle. The green area is a box scaled by 0.3 times the yellow area. For such a box, the author defines such a score map (b map), which is the text score feature map. By calculating its circumscribed rectangle for the text area, that is, the pink area in the picture c, the label information of the RBOX can also be obtained. For the angle, we also calculate the angle generated by the rectangular area and the horizontal line, which is expressed as the rotation angle of the RBOX. With such a bounding rectangle area and the original text area, we can calculate the coordinate offset from the four vertices of the rectangle to the pixel position. The four coordinate offsets are finally expressed as 8-dimensional output information. At this point, we can get the RBOX and the quadrilateral area, and use it for the subsequent regression and prediction of the network.

  • EAST loss function

The author adopts Balanced-xent (class balanced cross entropy), IOU loss and angle loss. These three loss methods are combined to get the final loss.

Class-balanced cross-entropy is mainly used to solve the problem of class-imbalanced training. Here ß is expressed as the ratio of the number of negative samples to the total number of samples. The IOU loss is expressed as the calculation result of the IOU corresponding to the rectangular box. Here is the IOU obtained for the rectangular area corresponding to the text area. Angle loss, where the cosine distance is used as the angle loss. By combining these three sets of losses, the loss function of the network is obtained and used for subsequent training.

In the training process, the author adopts the strategy of balanced sampling and hard example mining to solve the problem of unbalanced distribution of the target and improve the performance of the network. After obtaining the final detection Boundiing box, the author also proposes an optimization for the final NMS algorithm, and proposes a local-aware NMS strategy. For locality-aware NMS, the authors adopt some of the following strategies to improve on standard NMS.

If the IOU of the two regions is higher than a certain threshold, the author will merge the two output boxes. The coordinate value of the combined output box is the middle of the two combined boxes. Through such a strategy, more regression information is utilized to reduce the final error. The locality-aware NMS algorithm can improve the speed at which the final result is generated.

  • EAST network performance comparison

Comparing the EAST algorithm with other text detection algorithms, we can see that the EAST algorithm can also achieve better results on two data sets such as ICDAR 2015 and MSRA-TD500. The author also compares the final detection performance obtained by using different backbone networks and different regression strategies. It can be found that the results obtained with RBOX and MS are optimal.

  • EAST model renderings

For text pictures in natural scenes, EAST can detect text areas in different directions, different angles, different backgrounds, different environments, different fonts, etc.

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324079968&siteId=291194637