Text Recognition OCR Analysis: Features

A brief exploration of OCR technology: feature extraction (1)

Research Background

Regarding optical character recognition (Optical Character Recognition, hereinafter referred to as OCR), it refers to the conversion of text on images into computer-editable text content. Many researchers have been studying related technologies for a long time, and there are many mature OCR technologies. and products, such as Hanwang OCR, ABBYY FineReader, Tesseract OCR, etc. It is worth mentioning that ABBYY FineReader not only has a high accuracy rate (including the recognition of Chinese), but also retains most of the typesetting effects, which is a very powerful OCR commercial software.

However, among the many finished OCR products, except for Tesseract OCR, the others are closed-source and even commercial software, which we can neither embed into our own programs nor improve on them. Open Source The only option is Google's Tesseract OCR, but its recognition effect is not very good, and the accuracy of Chinese recognition is low, which needs to be further improved.

To sum up, whether it is for academic research or practical application, it is necessary to explore and improve OCR technology. Our team divides the complete OCR system into "feature extraction", "text positioning", "optical recognition", " The four aspects of "language model" were gradually solved, and finally an available and complete OCR system for printed text was completed. The system can be initially used for image text recognition on e-commerce, WeChat and other platforms to judge the above information. of authenticity.

research hypothesis

In this paper, we assume that the text part of the image has the following characteristics:

1. Assume that the image fonts we want to identify are relatively standard printed fonts, such as Song, Hei, Kai, and running script;

2. There should be a clear contrast between the text and the background;

3. When designing the model, we assumed that the image and text are horizontally formatted;

4. The strokes of the text should have a certain width, not too thin;

5. The color of the same text should be gradient at most;

6. General characters are formed by relatively dense strokes, and often have certain connectivity.

It can be seen that these features are common features of common e-commerce posters, etc., so these assumptions are more reasonable.

Analysis process


Figure 1: Flowchart of our experiment

 

Feature extraction

As the first step of the OCR system, feature extraction is to find the features of the candidate text regions in the image, so that we can locate the text in the second step and identify in the third step. In this part of the content, we focus on imitating the processing of images and Chinese characters by the naked eye, and take an innovative path in the processing of images and the positioning of Chinese characters. This part of the work is the core part of the entire OCR system, and it is also the core part of our work.

 

Most of the traditional text segmentation ideas are "edge detection + corrosion expansion + connected area detection", such as the paper [1]. However, performing edge detection on images with complex backgrounds will result in too many edges in the background part (that is, increased noise), while edge information in the text part is easily ignored, resulting in poor performance. If corrosion or expansion is performed at this time, the background area will be glued to the text area, and the effect will be further deteriorated. (In fact, we've gotten far enough down the road that we even wrote our own edge detection function to do this, and after a lot of testing, we finally decided to drop this line of thinking.)

 

Therefore, in this paper, we give up edge detection and erosion expansion, and obtain relatively good features of the text part through clustering, segmentation, denoising, pooling and other steps. The whole process is roughly as shown in Figure 2. These features can even be Input directly into the text recognition model for recognition without additional processing. Since each part of our results is supported by the corresponding theoretical basis, the reliability of the model can be guaranteed.

Figure 2: Approximate process of feature extraction

 

In this part of the experiment, we demonstrate our effect with Figure 3. This image is characterized by medium size, dazzling background, rich colors, mixed typesetting of text and pictures, and the typesetting format is not fixed. It is a typical e-commerce promotional image. It can be seen that the main point of processing this picture is how to identify the picture area and the text area, identify and remove the rice cooker at the right end, and only keep the text area. 

Figure 3: Introduction of Xiaomi Rice Cooker

 

image preprocessing

First, we read the original image in the form of a grayscale image, and get a grayscale matrix, where m, n are the length and width of the image. In this way, the dimensionality of reading in RGB color images is lower than that of directly reading in RGB color images, and there is no significant loss of text information. Converting to a grayscale image is actually converting the three channels of the original RGB image to the following formula  (1)

Combined into one channel:


The grayscale image of Figure 3 is as follows

The size of the image itself is not large. If it is processed directly, the strokes of the text will be too small and easily processed as noise. Therefore, in order to ensure that the strokes of the text have a certain thickness, the image can be enlarged first. In our experiments, it is generally better to enlarge the image to twice the original size.

 

However, when the image is enlarged, the distinction between the text and the background is reduced. This is because an interpolation algorithm is used to fill in the missing pixels when the image is enlarged. At this time, the degree of discrimination needs to be increased accordingly. After testing, in most pictures, the "power transformation" with the number of 2 is better. The power transformation is

(2)

Where x represents the elements in the matrix M, r is the number of times, here we choose 2. Then you need to map the result to the [0,255] interval:

   (3)

where  are the maximum and minimum values ​​of the matrix.

 

grey clustering

Then we cluster the color of the image. There are two factual basis for clustering:

 

  1. Grayscale resolution    The grayscale resolution of the naked eye is about 40, so for pixel values ​​254 and 255, it is just white to our naked eye;

  2. Design principles    According to our general aesthetic principles, when considering poster design, clothing matching, etc., it is generally required that the color matching of clothing and posters should not exceed three colors.

 

More generally speaking, although the grayscale image level range is [0, 255], the overall tone we can feel is generally not much. Therefore, similar color levels can be classified into one category to reduce the color distribution. Effectively reduce noise.

 

In fact, clustering is a process of adaptively multi-valued according to the characteristics of the image, which avoids the loss of information caused by the traditional simple binarization. Since we need to automatically determine the number of clusters, the traditional clustering methods such as KMeans have been abandoned by us, and after our tests, feasible clustering methods such as MeanShift have shortcomings such as slow speed. Therefore, we designed the clustering method by ourselves, using the idea of ​​"kernel probability density estimation", and clustering by finding the extreme value of color density.

 

Kernel density estimation  The preprocessed image, we can count the number of occurrences of each color level, and get the frequency distribution histogram as shown in Figure 5:


Figure 5: Grayscale statistics on preprocessed images

 

It can be seen that the distribution of color levels has formed several prominent peaks, in other words, there is a certain clustering trend. However, the statistical results of the histogram are discontinuous, and a smooth result is more convenient for our analysis and research, and the result is more convincing. A method of smoothing statistical results is kernel density estimation.

 

Kernel density estimation method is a non-parametric estimation method proposed by Rosenblatt and Parzen, which has been highly valued in both statistical theory and application fields [2]. Of course, it can also be simply regarded as a function smoothing method. When we estimate the probability (density) of a value based on a large amount of data, what we actually do is to estimate the following:

(4)

where K(x) is called the kernel function. When taken as 1, and K(x) takes

     (5)

, which is our histogram estimation above. The meaning of K(x) is very simple. It tells us that everything in the range h is counted into x. As for how to count, it is given by . It can be seen that the choice of h has a great influence on the result, and h is called the bandwidth, which mainly affects the smoothness of the result. If K(x) is discrete, the result is still discrete, but if K(x) is smooth, the result is relatively smooth. A commonly used smooth function kernel is the Gaussian kernel:

   (6)

The resulting estimate is also called a Gaussian kernel density estimate. Here, we use Scott's rule to adaptively choose, but need to manually specify a smoothing factor, in this paper, we choose 0.2. For the example image, we get the result as the red curve in Figure 6.

 


Figure 6: Gaussian Kernel Density Estimation of Frequency Distribution

 

max min split

 

From Figure 6, we can further see that there is indeed a clustering trend in the image. This shows that it has several distinct maxima and minima, where the maxima are at x = 10, 57, 97, 123, 154 and the minima are at 25, 71, 121, 142.

 

Therefore, a natural clustering method is to cluster into as many categories as there are maximum points, and use the minimum points as the boundary between categories. That is to say, for Figure 3, the image can be layered into 5 layers and processed layer by layer. After layering, the shape of each layer is as shown below, where white is 1 and black is 0.

 

                       

 Divide the image into 5 layers by clustering

 

It can be seen that, due to the "contrast" and "gradient" assumptions, the text layers can indeed be separated by the clustering method of kernel density estimation by clustering. Moreover, through the idea of ​​clustering and stratification, there is no need to make any assumptions about the text color, and even when the text color is consistent with the background color, effective detection can be obtained.

 

Layer-by-layer recognition

When the image is effectively layered, we can further design the corresponding model according to the previous assumptions, and find the text area in the image through layer-by-layer processing.

 

connectivity

It can be seen that the image of each layer is composed of several connected regions, and the text itself is composed of relatively dense strokes, so the text can often form a connected region. The connectivity here is defined as 8-adjacency, that is, 8 pixels around a pixel are defined as adjacent pixels, and adjacent pixels are defined as the same connected area.

 

After defining the connected regions, each layer is divided into several connected regions, that is, we gradually decompose the original image, as shown in Figure 9.

Figure 9 Image decomposition structure diagram

 

Corrosion Resistance Once    the image is broken down to the granularity of connected regions, we no longer subdivide, and the next step is to identify which regions are possible text regions. Here we require the text to have a certain anti-corrosion ability. So let's define corrosion first.

 

Corrosion is a morphological transformation on an image. It is generally aimed at binary images. For non-zero pixels in a binary image (that is, pixels with a value of 1), if its adjacent pixels are all 1, it will remain unchanged. , otherwise it becomes 0, here we also use the definition of 8 adjacency. It can be seen that if the boundary line of the connected area is longer, the "damage" of the erosion operation will be greater. On the contrary, if the boundary line of the connected area is shorter, the "damage" of the erosion operation will be smaller.

 

According to the above definition of corrosion, we can give a requirement for the text area:

 

Corrosion resistance requires         that the connected area where the text is located should have certain corrosion resistance.

 

The "certain" here refers to a continuous range, neither too large nor too small. For example, a square area with a large area has strong anti-corrosion ability because its boundary line is very short, but these areas are obviously not text areas. It belongs to this type; in addition, the anti-corrosion ability is too weak, such as slender lines, which may disappear after corrosion, and these are not used as candidate text areas. Text borders fall into this category.

 

An indicator of corrosion resistance can be defined here:

Corrosion resistance of connected areas = total area of ​​this area after corrosion / total area of ​​this area before corrosion (7)

 

After testing, the corrosion resistance of the text area is probably in the range of [0.1, 0.9].

 

After screening and decomposing the 5 layers for the anti-corrosion ability, the characteristic layer as shown in the figure below is obtained.

 

Only the connected regions with corrosion resistance in the interval [0.1, 0.9] are retained

 

Pooling operation So far, we have obtained 5 feature layers, although the naked eye can see that the text is mainly concentrated in the 5th feature layer. However, for general pictures, the text may be distributed in multiple feature layers, so the feature layers need to be integrated. Our method for feature integration here is similar to "pooling" in convolutional neural networks, so we also borrow this name. First, we stack 5 feature layers to get an overall image feature (called stack feature). Such image features can be used as the final feature output, but it is not the best method. We believe that the main text features in a certain area should have been concentrated in a certain feature layer, rather than scattered in all feature layers. Therefore, after obtaining the superimposed features, use a method similar to "max pooling" to integrate the features. The steps are as follows:

1. Directly stack the features, and then divide the stacked features into connected regions;

2. To detect which feature layer is the main contribution of each connected region, the connected region only retains the source of this feature layer.

 

After such a pooling operation, the final feature result obtained is shown in Figure 11.

 

Figure 11 Features after pooling

 

post processing

For the image we demonstrated, after the above operations, the obtained feature map 11 no longer needs to be processed. However, for general pictures, there may still be some unprocessed areas, which need to be further excluded based on the aforementioned results. There are two main steps in the exclusion process, one is the exclusion of low/high density areas, and the other is the exclusion of isolated areas.

 

Density exclusion    A connected area that is obviously not a text area is a low-density area. A typical example is a connected area composed of table lines. Such an area has a large range but few points, that is, the density is very low. Density regions can be excluded. First we define the connected region density and low density region:

 

Connected area density   Starting from a connected area, the horizontal circumscribed rectangle of the connected area can be found, and the density of this area is defined as

Connected area density = area of ​​connected area/area of ​​circumscribed rectangle × total area of ​​original image/area of ​​circumscribed rectangle (8)

 

Low-density area    If the density of a connected area is less than 16, then this connected area is defined as a low-density area.

 

The intuitive definition should be the area of ​​the connected area / the area of ​​the circumscribed rectangle, but here is a factor of the total area of ​​the original image / the area of ​​the circumscribed rectangle, the purpose is to add the influencing factor of the area size, because the text generally has obvious The boundaries of , are easily divided, so generally speaking, the larger the area, the less likely it is to be a text area. The parameter 16 here is the empirical value. Low-density area exclusion is an effective way to exclude non-text areas with many lines, such as tables. Similarly, a large area of ​​high density is also a type of area that needs to be excluded. Once you have the low-density regions, it's easy to define the high-density regions:

 

Definition of High Density Region * If a connected region inverted by a horizontally circumscribed rectangle is a low density region, then this connected region is defined as a high density region.

 

This definition is very natural, but it has certain irrationality. For example, the word "one" is a horizontal rectangle, so the density after flipping is 0, so the word "one" is excluded, which is unreasonable. One solution to this problem is:

A high-density area     is defined as a high-density area if and only if the following conditions are met:

  (area of ​​rectangle − area of ​​connected region)/area of ​​circumscribed rectangle × area of ​​circumscribed rectangle/total area of ​​original image < 16 (9)

 

This is to add 1 to the original definition to prevent the density from being 0 after flipping.

 

There is another failure situation, that is, if the input image is a single-character image, there is only one connected area, and the area of ​​the rectangle circumscribing the total area of ​​the original image is close to 1, so it is judged as a low-density area, which excludes single word. This situation is really difficult to balance. A possible solution is to manually specify whether it is a single-word mode, a single-line model, or a whole image mode. Google's Tesseract OCR also provides such an option.

 

isolated area

The starting point of the exclusion of isolated areas is: characters and strokes should be relatively compact. If an area is obviously isolated from other areas, then this area is probably not a text area. In other words, isolated areas can be excluded. First we define the concept of an isolated area:

 

The isolated area     starts from a connected area, you can find the horizontal circumscribed rectangle of the connected area, expand the center of the rectangle symmetrically to 9 times the original (length and width become 3 times the original, as shown on the left), expand If the latter area does not contain other connected areas, the original connected area is called an isolated area.

 

In most cases, outlier exclusion is a very simple and effective denoising method because many noise points are outliers. However, the exclusion of isolated areas will have certain risks. If an image has only one text and constitutes the only connected region, then the connected region is isolated, and the text is excluded. Therefore, to impose more constraints on the isolated regions, an optional additional constraint is: the area of ​​the excluded isolated regions / the area of ​​the connected region / the area of ​​the circumscribed rectangle is greater than 0.75 (this value is derived from the circle and circumscribed region). The ratio of the areas of the squares is π/4).

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324375405&siteId=291194637