[Reprinted] Introduction to OCR

Reprint learning: https://www.jianshu.com/p/921c1da740b5

OCR ( Optical Character Recognition , Optical Character Recognition ) refers to electronic devices (such as scanners or digital cameras) checking characters printed on paper, determining their shape by detecting dark and light patterns, and then using character recognition methods to translate the shapes into computers The process of writing.

On the whole, OCR is generally divided into two major steps: image processing and text recognition .

One, image processing

Before recognizing text, we need to pre-process the original image for subsequent feature extraction and learning. This process usually includes:

Grayscale, binarization, noise reduction, tilt correction, text segmentation and other sub-steps.

Each step involves different algorithms.

We take the original picture below as an example to explain each step.

Original Image

 

1. Grayscale

Gray processing . In the RGB model, if R=G=B, the color represents a grayscale color. The value of R=G=B is called the gray value. Therefore, each grayscale image The pixel only needs one byte to store the gray value (also called intensity value, brightness value), and the gray range is 0-255.

To put it simply, it is to turn a color picture into a black and white picture.

Grayscale

Graying generally has four methods: component method, maximum value method, average method, and weighted average method to grayscale color images.

 

2. Binarization

An image includes target objects, background, and noise. To directly extract target objects from a multi-value digital image, the most common method is to set a threshold T, and use T to divide the image data into two parts: greater than T pixel group and the pixel group is smaller than T .

This is the most special method for studying grayscale transformation, which is called image binaryzation.

Binary black and white pictures do not contain gray, only pure white and pure black .

Binarization

The most important thing in binarization is the selection of threshold, which is generally divided into fixed threshold and adaptive threshold . The more commonly used binarization methods are: bimodal method, P parameter method, iterative method and OTSU method .

 

3. Image noise reduction

Digital images in reality are often affected by the interference of imaging equipment and external environmental noise during the digitization and transmission process, and are called noisy images or noisy images.

The process of reducing noise in a digital image is called image denoising ( Image Denoising ).

There are many sources of noise in images, and these noises come from various aspects such as image acquisition, transmission, and compression. The types of noise are also different, such as salt and pepper noise, Gaussian noise, etc. There are different processing algorithms for different noises.

In the image obtained in the previous step, you can see many sporadic small black spots. This is the noise in the image, which will greatly interfere with the cutting and recognition of the image by our program, so we need noise reduction processing. Noise reduction is very important at this stage, and the quality of the noise reduction algorithm has a great influence on feature extraction.

Image noise reduction

The image noise reduction method generally mean filter, adaptive Wiener filter, median filter, morphologic noise filter, wavelet denoising .

 

4. Tilt correction

For the user, it is impossible to be absolutely horizontal when taking pictures. Therefore, we need to rotate the image through the program to find a position that is considered to be the most horizontal, so that the cut image may be the best one. effect.

The most commonly used method of tilt correction is Hough Transform , the principle of which is to expand the picture and connect the intermittent text into a straight line to facilitate straight line detection. After calculating the angle of the straight line, you can use the rotation algorithm to correct the oblique image to a horizontal position.

 

5, character cut

For a multi-line text, text segmentation includes two steps: line segmentation and character segmentation . Skew correction is the premise of text segmentation .

We project the skew-corrected text to the Y axis and add up all the values ​​so that we can get a histogram on the y axis.

Projection histogram of the picture on the Y axis

 

The bottom of the histogram is the background, and the peak is the area where the foreground (text) is located. So we recognize the position of each line of text.

Row segmentation

 

Character segmentation is similar to line segmentation, but this time we need to project each line of text to the X axis.

However, it should be noted that the two characters in the same line are often close together, and sometimes they overlap in the vertical direction. When projecting, they are considered to be one character, which will cause errors when cutting (mostly appearing in English characters). ); Sometimes there is a small gap in the projection of the left and right structure of the same character on the X axis. When cutting, a character is mistakenly divided into two characters (mostly in Chinese characters).

Therefore, compared to line segmentation, character segmentation is more difficult.

In this case, we can pre-set an expected value of the character width. If the projection of the cut out character exceeds the expected value, it is considered to be two characters;

If it is much smaller than the expected value, ignore this gap and combine the "characters" on the left and right sides of the gap into one character for recognition.

Character segmentation

 

 

Two, text recognition

After the preprocessing is completed, it comes to the stage of text recognition. This stage will involve some artificial intelligence knowledge, which is relatively abstract and cannot be expressed with pictures.

 

1. Feature extraction and dimensionality reduction

Features are the key information used to identify text. Each different text can be distinguished from other texts by features. For numbers and English letters, this feature extraction is relatively easy, a total of 10 + 26 x 2 = 52 characters, and they are all small character sets. For Chinese characters, the feature extraction is more difficult, because firstly, Chinese characters are a large character set; secondly, there are 3755 first-level Chinese characters that are most commonly used in the national standard; finally, the structure of Chinese characters is complex, with many similar characters and features. The dimension is relatively large.

After determining which features to use, it is possible to perform feature dimensionality reduction . In this case, if the dimensionality of the feature is too high, the efficiency of the classifier will be greatly affected. In order to increase the recognition rate, it is often necessary to The process of dimensionality reduction is also very important. It is not only necessary to reduce the feature dimension, but also to make the feature vector after the dimension reduction retain sufficient information (to distinguish between different texts).

 

2. Classifier design and training

For a text image, extract features, throw them to the classifier, the classifier classifies it, and tells you which text the feature should be recognized as.

The design of the classifier is our task. The design methods of classifiers generally include: template matching method, discriminant function method, neural network classification method, rule-based reasoning method, etc. , which are not described here. Before actual recognition, the classifier is often trained, which is a supervised learning process. There are also many mature classifiers, such as SVM, CNN and so on.

 

3. Post-processing

In fact, it is to optimize the classification results of the classifier, which generally involves the category of natural language understanding.

The first is the processing of similar characters: for example, "分" and "xi" are similar in shape, but if you encounter the word "score", it should not be recognized as "xi number", because "score" is a normal Words. This needs to be corrected by the language model .

The second is the treatment of text layout: for example, some books are divided into two columns, the left and right columns of the same line do not belong to the same sentence, and there is no grammatical connection. If you cut according to the line, the end of the left line and the beginning of the right line will be joined together, which is what we don't want to see. This situation requires special treatment.

 

Three, application scenarios

 

1. Digital native

Taobao product map is the most representative digital native text map. 
Features:

  • The most complex and diverse: various fonts, backgrounds, permutations, combinations, etc. (MTWI Challenge-the largest OCR competition). 
  • Most valuable: commodity information carrier 
  • The largest amount of pictures: hundreds of billions of pictures, updated daily.

 

2. Document

Document OCR requirements are very wide, involving various official business scenarios. 
Features:

  • 100% recognition rate: 98% accuracy of human input, exploring the limits of AI knowledge;
  • Product ease of use: perfect functions, close to business needs;
  • Commercial application: The document business is mature. 

 

3. Photo form:

The OCR of the photo form is very valuable and very challenging. 
Features:

  • Scenarios & data: data has privacy, and typical application scenarios accumulate technical capabilities;
  • Product versatility: Expert knowledge + template = text understanding, a set of solutions to hundreds of types.
  • Business value: In-depth integration with industry scenarios, AI capabilities to improve industry data processes. (Provide customized photo form recognition and structured cloud services) 


4. Natural scene category:

The key direction of OCR academic research. 
Features:

  • Data: No specific data type definition, such as street shooting data;
  • Technical difficulties: Uncertainty, the essential difficulty of complex environment interference is positioning and identification;
  • Commercial value: The market has huge potential, such as license plate recognition, camera monitoring, and autonomous driving. (Leading technical capabilities, industry is in progress)

 

[Reprinted study, no other use]

Guess you like

Origin blog.csdn.net/zhongguomao/article/details/108303104