Classical model of big talk text detection: CTPN

Text recognition is an important application of AI , such as recognizing the text on the packaging box, recognizing the text on the product manual, recognizing the text on the billboard on the street, etc., which can bring us in real life. It is very convenient and has a very wide range of applications.

A simple text recognition process is as follows:

Step 1. Collect images containing characters to be recognized through mobile phones, cameras and other devices as input;

Step 2. Perform preprocessing operations such as size scaling, shading adjustment, and denoising on the image;

Step 3. Detect a single character in the image, or the area where several consecutive characters are located;

Step 4. Segment the text area from the image according to the text detection result, and then import it into the model for text recognition, and then obtain the character information in the image.

Among them, this process has two very critical links, one is " text detection " and the other is " text recognition " . This article will introduce the classic model CTPN of " text detection " , and the " text recognition " model will be introduced later. stay tuned.

For the detection of printed fonts, due to the standard layout, the current detection and recognition technology is very mature. Wechat and QQ , which we use every day, have the function of extracting text from pictures. For text detection in natural scenes, due to the various forms of lighting environment and text, it is relatively difficult to detect text. For example, to detect text on billboards on the street, as shown below: 

This article mainly introduces the classic model of text detection: CTPN , which can not only be used to detect text in natural scenes, but also the detection of printed text.

1. Characteristics of text distribution

Before understanding text detection, let's take a look at the characteristics of text distribution. Whether it is printed text or text in natural scenes, the general text is arranged horizontally, and the length of consecutive characters is variable, but the height is basically the same, as shown below: 

This is also the basic idea of ​​CTPN . Since the width is variable and uncertain, then the detection is performed at a fixed height to see which areas in the image are continuous areas with the same height characteristics, and their edges match the text. characteristics, circle it.

2. What is CTPN

CTPN , the full name is "Detecting Text in Natural Image with Connectionist Text Proposal Network" (text detection based on connection pre-selected box network). The model is mainly to accurately locate the text lines in the picture. The basic method is to directly generate a series of text proposals (pre-selection boxes) of appropriate size on the feature map (feature map) obtained by convolution to detect the text lines. The following figure can well see the detection idea of ​​the model (note: the CTPN model actually generates proposals on the feature map , not on the original image, the following is just a schematic diagram): 

3. Principle of CTPN technology

The CTPN model utilizes the seamless combination of RNN and CNN to improve detection accuracy. Among them, CNN is used to extract deep features, and RNN is used for sequence feature recognition. The two are seamlessly combined and have better performance in detection. in:

( 1 ) CNN (using VGG16 )

The CTPN model generates a series of proposals (pre-selected boxes) for detection by using the feature map (feature map) output by VGG16 convolution . VGG is a classic model of convolutional neural network. For details, please refer to the article published before this official account: Vernacular Convolutional Neural Network ( VGGNet )

2 RNN

Since text information is a sequence composed of " characters, part of characters, and multiple characters " , the detection target of text is not independent and closed, but is related before and after. Therefore , RNN ( Recurrent Neural Networks ) is used in CTPN . , recurrent neural network) to use the information of the context to predict the text position. For the introduction of RNN , please refer to the article published before this official account: Vernacular Recurrent Neural Network ( RNN )

The network structure of the CTPN model is shown in the following figure: 

The whole process is mainly divided into six steps:

Step 1: Input an image of 3×600(h)×900(w) , use VGG16 to extract features, and get the features of conv5_3 ( the third convolutional layer of the fifth block of VGG) as a feature map with a size of 512×38×57 ;

Step 2: Make a sliding window on this feature map , the window size is 3×3 , that is, 512×38×57 becomes 4608×38×57 ( 512 is expanded by 3×3 convolution);

Step 3: Input the features corresponding to all windows of each row into RNN ( BLSTM , bidirectional LSTM ), each LSTM layer is 128 hidden layers, that is, 57×38×4608 becomes 57×38×128 , Reverse- LSTM also obtains 57×38×128 , and the final result after merging is 256×38×57 ;

Step 4: Input the result of the RNN to the FC layer (full connection layer), the FC layer is a 256×512 matrix parameter, and the result is 512×38×57 ;

Step 5: The FC layer features are input into three classification or regression layers. The first 2k vertical coordinate and the third k side-refinement are used to return the position information of k anchors (it can be simply understood as a small rectangular box to determine the position of the character, the red small long box in the above schematic diagram, the width Fixed, the default is 16 ), the second 2k scores represent the category information of k anchors ( characters or not characters);

Step 6: Using the algorithm of text construction, merge the obtained elongated rectangular box into a sequence box of text. The main idea of ​​the text construction algorithm is: every two similar candidate regions form a pair , and merge different pairs until they can no longer be merged.

The above is the introduction of the main principles of CTPN. The CTPN model is used to detect text in natural scenes. The results are shown in the following figure: 

4. Summary

In summary, the biggest highlight of the CTPN model is the introduction of RNN for detection. First use CNN to obtain depth features, then use fixed-width anchors (fixed-width, slender rectangular boxes) to detect text areas, string the features corresponding to the same row of anchors into sequences, and then input them into RNN , and then use full connection Layers are used for classification or regression, and finally the small candidate boxes are merged to obtain the complete area where the text is located. This seamless combination of RNN and CNN effectively improves the detection accuracy.

 

Wall Crack Advice

In 2016 , Zhi Tian  et al. published a classic paper on CTPN " Detecting Text in Natural Image with Connectionist Text Proposal Network ", in which the ideas and technical principles of CTPN are introduced in detail . It is recommended to read this paper to learn more about the model.

Follow my official account "Big Data and Artificial Intelligence Lab" (BigdataAILab), and then reply to the keyword " thesis " to read the content of classic papers online .

 

Recommended related reading

{{o.name}}
{{m.name}}

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324127739&siteId=291194637