Tesseract OCR Rationale introduction and use of

1.Tesseract Introduction

Tesseract is a benefit Pub Lindstedt laboratory between 1985 and 1995 to develop a an open source OCR engine, once among the best in the 1995 UNLV accuracy of the test. But after 1996 basically stopped developing. In 2005, HP will open source its foreign, 2006 pairs Tesseract improved by Google, and the elimination of Bug, optimization. Currently the project address: https://github.com/tesseract-ocr/tesseract.

It combines with Leptonica picture processing library that can read images in various formats and convert them into more than 60 languages ​​of text, we can also continue to train their library, the ability to convert text images of growing.

Tesseract 4 LSTM increased, primarily used to identify the line, below the line when it comes to text.

Tesseract general aspect is based on the recognized characters, in particular a polygonal approximation method, the identification step is a step by step.

The following content is basically Ray Smith "An Overview of the Tesseract OCR Engine" translated summary.

1.1 Tesseract structure

Here Insert Picture Description
1. The connected component analysis, the character region detection region (contour shape), and the sub-profile. At this stage the contour integration block area.

2. derived from text block area and the outline of the character line (text lines). There are two lines of text analysis method of a fixed scene is a scene to scale. Fixed scene a single character by character segmentation means proportionally scene (Proportional text) to clear spaces divided by the interval and fuzzy (fuzzy spaces).

(1) The figure is a fixed scene, the word spacing is fixed.
Here Insert Picture Description
(2) The following non-fixed scene, or a scene to scale. Distinguished by a space. However, the width of the spaces and not the same, so the scene blur introduced by spaced, fuzzy interval for further processing in character recognition after.
Here Insert Picture Description
3. sequentially performed for each word analysis, adaptive classifier, the classifier learning ability, and to analyze the condition of words also satisfies as training samples, the following character (such footer) identified more accurately: At this time, top of the character recognition is less accurate, so tesseract will once again not good for character recognition is the recognition accuracy is improved. This place has it processed twice.

Text 4. Finally, resolve ambiguous spaces, and check the x-height, positioning (small-cap), case processing.

1.2 positioning text lines and text

Mainly based on the page analysis has been carried out, the basic know which part of the text area, then these blob processing to generate text line. Substantially square line using spline methods, perhaps better methods using a cubic spline.
Here Insert Picture Description
As shown above, including basic line, descending line, the average line, ascending line, they are parallel with a slightly curved. FIG rising line on the blue line.

1.3 character recognition

This part is described in the non-text recognition fixed width.

1.3.1 Segmentation character connected

When identified from a complete word is not satisfied with the results, tesseract to improve results through character-level segmentation blob. Some vertices is concave polygonal profile as a candidate division point, and the concave line or point in opposite directions. This is part 3 can be successfully divided character connection.
As shown below, arrows represent a series of points of division. Split between r and the letter m is a line segment.
Here Insert Picture Description

1.3.2 Joint broken characters

When a potential split point has run out, or can not meet the requirements, a bad character recognition, it uses joint control (associator). Associator will try to search for the highest priority, the largest division of the blob into a possible joint candidate characters.
Fully-chop-then-associate method may not be effective and may be missing important to split the clip. However, chop-then-associate the data structure can be simplified, we can maintain the integrity of segmentation map.
The following words used in this method can be simple identified.
Here Insert Picture Description

1.4 static character classification

1.4.1 Characteristics

Instead of using topological feature shape, but the use of polygonal approximation. This approach is not damaged character is robust.
But to find a good solution characteristics, is unknown characteristics and does not require training remains the same.
Below, short and thick lines represent unknown characteristics, they are small, fixed-length feature; elongated polygonal approximation line segment represents the polymerization. With small features to match a large prototype is to identify the image easily damaged. Just requires a lot of computing.
Here Insert Picture Description

1.4.2 Classification

Sorting process comprising two steps. The first step is to create a character classification candidate set final part is not recognized. The second step is to calculate the step candidate set and from the best combination of the prototype.

1.5 Voice Analysis

Tesseract a little linguistic analysis, such as the use of words often, and so on.
Each character classification generates two numbers, a negative number is normalized distance, the larger value represents the better; Rating is another value, multiplied by the distance equal to the prototype standard length of the entire profile of the unknown character.

1.6 suitability classification

Because static classification good at generalization process any font, but the character and non-character is treated differently.
The difference between static and adaptive classification is classification, classification adaptation of isotropic baseline / x-height normalization processing, the static type classification is the centroid normalized.
baseline / x-height method is easy to identify the uppercase and lowercase characters.
The figure is showing two methods.

Here Insert Picture Description

1.7 summary

Tesseract is characterized by its unusual handling characteristics. His shortcomings is precisely the use of polygonal approximation method, rather than a rough outline.

2. Download and install

Download the installation file

Windows can download the exe file to install on the line.
Download: https: //tesseract-ocr.github.io/tessdoc/Home.html

Download fonts

For example, the Chinese character Address:
https://raw.githubusercontent.com/tesseract-ocr/tessdata/master/chi_sim.traineddata

3. Set Environment Variables

Path environment variable

In the "My Computer" right click and select "Properties - Advanced System Settings - Environment Variables", add the environment variables in the path, as follows.

Here Insert Picture Description

TESSDATA_PREFIX environment variable record

错误:
Please make sure the TESSDATA_PREFIX environment variable is set to your “tessdata” directory.

Here Insert Picture Description

Solutions continue to add environment variables TESSDATA_PREFIX as follows:
Here Insert Picture Description

4. Run the test

Format: tesseract imagename outputbase [-l lang] [-oem ocrenginemode] [-psm pagesegmode] [configfiles ...]

Example: tesseract t1.jpg result.txt -l chi_sim + eng
Parameters:
chi_sim: Chinese language pack indicates
eng: represents English language pack.
Page Segmentation NUM the Specify the
MODE-PSM. -Oem NUM the Specify OCR Engine the MODE.
NOTE: THESE Options the MUST Occur the before the any configfile.

Page segmentation modes:
0 Orientation and script detection (OSD) only.
1 Automatic page segmentation with OSD.
2 Automatic page segmentation, but no OSD, or OCR.
3 Fully automatic page segmentation, but no OSD. (Default)
4 Assume a single column of text of variable sizes.
5 Assume a single uniform block of vertically aligned text.
6 Assume a single uniform block of text.
7 Treat the image as a single text line.
8 Treat the image as a single word.
9 Treat the image as a single word in a circle.
10 Treat the image as a single character.
11 Sparse text. Find as much text as possible in no particular order.
12 Sparse text with OSD.
13 Raw line. Treat the image as a single text line,
bypassing hacks that are Tesseract-specific.

OCR Engine modes:
0 Legacy engine only.
1 Neural nets LSTM engine only.
2 Legacy + LSTM engines.
3 Default, based on what is available.

Here Insert Picture Description

Here Insert Picture Description

Here Insert Picture Description

Summary, the effect okay it, fast. Accuracy is not accurate enough to feel. Especially complex background image.

Published 21 original articles · won praise 18 · views 1450

Guess you like

Origin blog.csdn.net/zephyr_wang/article/details/104928001