Research Tesseract algorithm

https://github.com/tesseract-ocr/tesseract/wiki information on github.

https://blog.csdn.net/guzhenping/article/details/51023687 blog about the tesseract.

An Over view of the personal papers of understanding theTesseract OCR Engine

Outline

Character outline form blobs at strategically (communication domain?), Is identified as blobs at strategically text lines, text lines and regions at regular intervals for analysis or proportional spacing. The line of text characters into a different character spacing, character segmentation quickly fixed intervals, and the ratio of spacing intervals determined using fuzzy divided character spacing. Recognition process is divided into two. In the first pass, try to identify each word in turn. Good word recognition results are transmitted to the adaptive training data classified as. Then, the adaptive classifier will have the opportunity to more accurately identify the text under the page. Due to the adaptive classifier learning, a second character recognition may otherwise be difficult to identify to identify. Finally, resolving an ambiguity interval, and checking to determine x-height assume lowercase letters.

Find rows

Line finding algorithm has been announced https://www.hpl.hp.com/techreports/94/HPL-94-113.pdf , the general method of inspection is inclined by the Hough transform applied to the image of the partial region simplified oblique inspection or to find a direction parallel to the pixel current line of text baseline, different projection angles, selecting the highest peak of the projection oblique direction.

This algorithm is as follows:

Connected component analysis, referred to as BLOB field communication (personal speculation), BLOB size and position indicated as frame coordinates.

Filtering blob, blob select collections, this collection may represent the body. Precision is not very important, mainly filter drop caps, underlined and noise isolation. Statistics up to the height, width as a reference, less than the height of the blob will be removed, leaving the height, width blob within a certain range.

According to the x coordinate of the left edge of the frame sort blob, such as ordering page can tilt to ensure blob will not be put in the wrong line. Find the largest blob intersecting lines existing, if not, then into the first blob, subsequent addition of new blob bounds limit extension line, updating the offset based on the average y blob lower bound.

(Personal understanding is in communication domain after use sorting, limiting according bounds, communication domain boundaries change and page oblique correlation, relatively fixed, optionally to ensure the majority of communication domains allocated in the exact line, then according to communication outside the box after dispensing coordinates fitting a straight line.)

Fitting baseline on the already-allocated blob, using the least median of squares (least squares?) Fit baseline.

The key point is that the interference filter and the blob text line configuration.

Line looking for pages without having to identify the tilt tilt correction, image quality assurance. Suppose layout analysis has provided the text area approximately the same size of the text, according to the height of the text lines, the height may be less than the filtered interference information. After filtration blob is more suitable for the fitting does not overlap, parallel, line model may be inclined. By sorting and processing the x-coordinate, BLOB may be assigned to different lines of text, with the slope of the page it can greatly reduce the risk of errors is assigned to lines of text in the presence inclined. The blob filtered interference assigned to the lines of text, with the least median of squares (least squares?) Fitting the baseline, the blob will re-fitting the filter back to the appropriate line. Finally, the level of overlap of at least half of the blob merger, punctuation and correct the baseline put together the characters split fractions were combined.

Fitting baseline

After finding the text line, the quadratic spline fit more precisely baseline. This is another precedent OCR system, which allows to handle Tesseract page with a curved baseline. The blob is divided into several groups to fit the baseline, and the original baseline reasonably continuous displacement. Fitting the densest region by the least squares method with quadratic spline.

 

 With a base, descending line, middle line and a rising line exemplary text lines, lines are parallel (y entire line interval constant, there may be slightly curved).

Spaced intervals and fixed ratio

Looking at regular intervals in a line of text, a fixed interval can quickly split into character. When the non-fixed intervals, measured in a limited range between the baseline and the vertical center line spacing, the proximity interval threshold fixed spatial blur at this stage, so that a final decision can be made after the word recognition.

Recognition

 

Adhesion hyphen (FIG. 4), by a polygonal approximation of the vertex candidate division points, using the identified confidence discriminated. Character segmentation error, when the candidate boundary use the results still poor, stitching. Splicing, search priority queue, a combination of identifying unidentified evaluated. After the first use of the divided stitching scheme it simplifies the divided character segment data structure maintained.

 

 

 

 

Unknown and features are not necessarily exactly the training corpus, in the training process, using an approximate polygon as a feature segment, but in recognition, extracting a minor (in normalized units) features fixed length from the profile, and for many-matching prototype feature clustering of the training data. Short thick lines are extracted from the unknown, the long, thin line by polygonal approximation aggregated segments, as a prototype (FIG. 6). Feature segmentation character features and complete characters do not match, but the feature is divided into small portions are well matched. This shows that feature matching by small mistake can solve the case segmented characters. The only problem is that the cost of computing the distance between the calculated position and wherein the prototype is high. Unknown extracted feature is represented by three-dimensional vector (x, y, angle), expressed by polygon approximation, the prototype dimensional vector (x, y, angle, length ).

classification

Categories are divided into two processes. First, create a list of feature classification, unknown characteristics may be matched. Each feature extracted from a coarse quantization of the three-dimensional lookup table, the sum of all of matching bit vectors, and wherein the highest (i.e. best matches) as the next list. And each unknown character matching prototype feature possible bit vector is calculated by comparing the degree of similarity between them. Each prototype is represented as a product of rational and (a logical sum-of-product expression), the similarity calculating sort prototype feature matching. Each distance calculating process will be recorded and each prototype feature overall similarity in evidence. Best overall similarity, summarizes the characteristics and prototypes. (Not fully understand how the statistical distance between the short and long feature polygons. May be considered as a whole features and local features integrated?)

Language Analysis

Select the following categories best results, the most frequent words, dictionaries, numbers, uppercase letters, lowercase letters, select the word classification. The end result is to select the minimum distance score, although each class multiplied by different constants. Different character segmentation may get a different result, even if the probability is difficult to directly compare these characters. Each character class to generate two numbers, the first is confidence, it is subtracted from the normalized prototype come. The second is the rate, multiplied by the length of the character outline normalized prototype of distance, because the length of the outline of a character is always the same. Baseline / x-height normalized, possible to prevent the high, low noise, and interference. The main benefit of character rectangle normalized aspect ratio can be eliminated and the effects of different width stroke fonts.

Adaptive classifier

Classification must use static overview of all cases, so for the character, ability to distinguish between non-character will fall. Therefore, more ability to distinguish between adaptive classifier training typically use static character of the classifier identified. Significantly different adaptation static classifier and classifiers that use different normalization methods. Baseline adaptive classifier using an isotropic / x-height normalized static classifier character positions and sizes of the second moment normalized by the centroid (the first moment). Baseline / x-height normalized for a better discrimination case letters in the anti-noise, normalized moment advantage is the elimination of characters and aspect ratio different stroke thickness.

 

看完这篇概述,主要的启发还是倾斜检测的想法,利用连通域来分析拟合基线。分割部分的黏连分割和拼接没有详细的说明,也不是很理解具体是怎么实现的,大概是结合了识别置信度来选择最佳分割点,还有固定间隔和比例间隔的判断依据是什么。识别部分不理解文中说的多边形拟合特征在整体不匹配,在局部又十分匹配,根据二者结合判断识别结果。以及分类器的训练方式。

欢迎一起讨论。

转载请注明。

 

Guess you like

Origin www.cnblogs.com/linguinost/p/11591935.html