Classic model of big talk text detection: EAST

Text detection in natural scenes is an important application of current deep learning. In previous articles, the deep learning-based text detection models CTPN and SegLink have been introduced (see the article: Classic Model CTPN for Big Word Text Detection, Classic Model for Big Word Text Detection SegLink ). A typical text detection model is generally divided into multiple stages (multi-stage), and the text detection needs to be divided into multiple stages (stage) for learning during training. This method of dividing and detecting complete text lines first and then merging them , which not only affects the accuracy of text detection, but also is very time-consuming. For text detection tasks, the more intermediate processes are processed, the worse the effect may be. So is there a fast and accurate detection model?

 

1. Introduction to the EAST model

The text detection model EAST introduced in this paper simplifies the intermediate process steps and directly implements end-to-end text detection. It is elegant and concise, and the accuracy and speed of detection have been further improved. As shown below:

Among them, (a), (b), (c), (d) are several common text detection processes. The typical detection process includes candidate box extraction, candidate box filtering, bouding box regression, candidate box merging and other stages. The process is rather lengthy. And (e) is the EAST model detection process introduced in this article. As can be seen from the above figure, the process is simplified to only the FCN stage (full convolution network), NMS stage (non-maximum suppression) , the intermediate process is greatly reduced, and The output results support multiple angle detection of text lines and words, which is efficient and accurate, and can adapt to a variety of natural application scenarios. (d) is the CTPN model. Although the detection process is similar to the EAST model of (e), it only supports horizontal text detection, and the applicable scenarios are not as good as the EAST model. As shown below:

 

Second, the EAST model network structure

The network structure of the EAST model is as follows:

The network structure of the EAST model is divided into three parts: feature extraction layer, feature fusion layer, and output layer .

The following is an introduction:

1. Feature extraction layer

Based on PVANet (a target detection model) as the backbone of the network structure, feature maps are extracted from the convolutional layers of stage1, stage2, stage3, and stage4, respectively. The size of the convolutional layer is halved in turn, but the number of convolution kernels is sequentially Double, this is a kind of "pyramid feature network" (FPN, feature pyramid network) idea. In this way, feature maps of different scales can be extracted to detect text lines of different scales (large feature maps are good at detecting small objects, and small feature maps are good at detecting large objects). This idea is very similar to the SegLink model introduced in the previous article;

2. Feature fusion layer

The previously extracted feature maps are merged according to certain rules. The merging rule here adopts the U-net method. The rules are as follows:

  • The feature map (f1) of the last layer extracted in the feature extraction layer is first sent to the unpooling layer, and the image is enlarged by 1 times
  • Then it is concatenated with the feature map (f2) of the previous layer (concatenate)
  • Then do the convolution with the convolution kernel size of 1x1 and 3x3 in turn.
  • Repeat the above process for f3 and f4, and the number of convolution kernels decreases layer by layer, 128, 64, 32 in turn.
  • Finally, after 32 kernels and 3x3 convolution, the result is output to the "output layer"

3. Output layer

The final output of the following 5 parts of information are:

  • score map: the confidence of the detection frame, 1 parameter;
  • text boxes: the position of the detection box (x, y, w, h), 4 parameters;
  • text rotation angle: the rotation angle of the detection frame, 1 parameter;
  • text quadrangle coordinates: The position coordinates of any quadrangle detection frame, (x1, y1), (x2, y2), (x3, y3), (x4, y4), 8 parameters.

Among them, the position coordinates of text boxes and the position coordinates of text quadrangle coordinates seem to be a bit repetitive, but they are not. This is to solve some distorted text lines, as shown below:

If only the position coordinates and rotation angle (x, y, w, h, θ) of the text boxes are output, then the predicted detection box is the pink box in the above figure, and there is an error with the position of the real text. At the end of the output layer, the position coordinates of any quadrilateral are output, so that the position of the detection frame (yellow frame) can be predicted more accurately.

 

3. EAST model effect

The effect of EAST text detection is shown in the figure below, in which some text lines with affine transformation are detected (such as billboards)

The advantage of the EAST model lies in the concise detection process, high efficiency, accuracy, and multi-angle text line detection. But there are also shortcomings, such as (1) the effect of detecting long text is relatively poor, mainly because the receptive field of the network is not large enough; (2) the effect is not very ideal when detecting curved text

 

4. Advanced EAST

In order to improve the defect that EAST's long text detection effect is not good, someone proposed Advanced EAST, which uses VGG16 as the backbone of the network structure, and also consists of three parts: feature extraction layer, feature merging layer, and output layer. After experiments, Advanced EAST has better detection accuracy than EAST, especially in the detection of long texts.

The network structure is as follows:

 

Wall Crack Advice

In 2017, Xinyu Zhou et al. published a classic paper on EAST "EAST: An Efficient and Accurate Scene Text Detector", in which the technical principles of EAST are introduced in detail. It is recommended to read the paper to learn more about the model.

 

Follow my official account "Big Data and Artificial Intelligence Lab" (BigdataAILab), and then reply to the keyword " thesis " to read the content of classic papers online .

 

Recommended related reading

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324105888&siteId=291194637