TesseractOCR（光学字符识别）引擎概述（二）

四、单词识别（Word Recognition）

**Figure2 .Block Diagram of Tesseract Word Recognition**

a.分割粘连的字符，将凹进去的轮廓点作为备选分割点，分割后，进行识别，如果都失败，就认为字符破损不全，修补字符

b.对破碎的字符，利用A*算法搜索最优的字符组合，直到达到满意的识别结果。(识别成功的基本是字符分类器可以很好的识别破碎的字符)

原文内容：Fig.2 is a block diagram of the word recognizer. In most cases, a blob corresponds to a character, so the word recognizer first classifies each blob, and presents the results to a dictionary search to find a word in the combinations of classifier choices for each blob in the word. If the word result is not good enough, the next step is to chop poorly recognized characters, where this improves the classifier confidence. After the chopping possibilities are exhausted, a best-first search of the resulting segmentation graph puts back together chopped character fragments, or parts of characters that were broken into multiple CCs in the original image. At each step in the best-first search, any new blob combinations are classified, and the classifier results are given to the dictionary again. The output for a word is the character string that had the best overall distance-based rating, after weighting according to whether the word was in a dictionary and/or had a sensible arrangement of punctuation around it. For the English version, most of these punctuation rules were hard-coded.

五、形状分类器（ Shape Classification）

Figure 3. (a) Prototype of h for Times Roman, (b) Match of a
broken h against prototytype.

特征

a.拓扑特征:与字体及大小无关

      b.将字符近以为多边形作为特征:对破碎不连通的字符无效

      c.突破性方案:训练阶段的特征与识别的特征可以不尽相同。在训练阶段，将近似多边形作为特征，而在识别阶段，抽取字符的轮廓特征并归一化，然后将训练集中的原型特征再与之，进行多对一的方式匹配(tesseract OCR采用c方案进行特征提取)

      The features extracted from the unknown:待识别字符的特征，3 维数据(x, y坐标，角度)，每个字符一般有50-100个特征

      the prototype features:原型特征(训练集中的字符特征)，4维数据(x,y坐标，角度，长度)一般有10-20个特征

原文内容：The features are components of a polygonal approximation of the outline of a shape. In training, a 4-dimensional feature vector of (x, y-position, direction, length) is derived from each element of the polygonal approximation, and clustered to form prototypical feature vectors. (Hence the name: Tesseract.) In recognition, the elements of the polygon are broken into shorter pieces of equal length, so that the length dimension is eliminated from the feature vector. Multiple short features are matched against each prototypical feature from training, which makes the classification process more robust against broken characters.

分类，分为两个步骤:

a.粗分（class pruner），多个特征，使用类似于LSH（局部敏感哈希）方法将每个特征相近的字符列举出来（1-10个字符）

b.细分（computes the distance），对相近的字符，用特征距离进行细分

原文内容：The shape classifier operates in two stages. The first stage, called the class pruner, reduces the character set to a short-list of 1-10characters, using a method closely related to Locality SensitiveHashing (LSH) [13]. The final stage computes the distance of the unknown from the prototypes of the characters in the short-list.

六、分词与检索（Segmentation and Search）

训练集中有最常用的高频词，字典中的常用词，常用数字，常用大写、小写。

将分割出的、待识别的词与这些词进行比较计算，算法采用加权最小距离。

问题:不同的分割，会识别出不同的结果。两种结果都有可能，原因在于分割的不确定。用两个指标进行量化，一个是confidence,将未知字符到原型的归- -化距离的负值为指标(confidence越大识别效果越好) ;第二是(rating) 将未知字符的轮廓长度与未知字符到原型的归一化的距离相乘作为指标。

七、自适应分类器（adaptive classifier）

      由于静态分类器涉及到多种字体，其区分相近字符、字符与非字符的能力被削弱。此时，由于每页文档内的字符的个数有限，利用静态分类器的结果可以训练出对字体更敏感的自适应分类器，可以提高分类能力。

      tesseract不用模板分类器，但使用相同的特征和分类作为静态分类器。静态与自适应分类器的区别，除了训练集外，还有自适应分类器会将-行字符的基线(baseline) /x-高度(小写字母x的高度)  归一化。归- -化后，很容易区分字母大小写及噪声:而静态分类器仅利用字符归一化的一阶矩确定位置，二阶矩确定字符大小。

      将字符的距归- -化最大的好处是去除高宽比 ( aspect ratio )和字体笔画宽度(stroke width)的影响，且使上标、下标的区分简单。但需要额外的分类特征来区分字母大小写。  (两种归一化:基线/x 行高的归一化，单个字符距的归一化)