C++ 调用 Tesseract

Tesseract-ocr 是一个知名的开源的 OCR 。这里简单写写它的 C++ API 接口的使用方法。

本文主要参考了：

还有就是API 帮助文档：https://ub-mannheim.github.io/tesseract/index.html

如何编译 tesseract 这里就不多说了。在 VC 下就是 vcpkg install tesseract 一条命令。

先看一个官方的例子：

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main()
{
    char *outText;

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    // Initialize tesseract-ocr with English, without specifying tessdata path
    if (api->Init(NULL, "eng")) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

    // Open input image with leptonica library
    Pix *image = pixRead("phototest.png");
    api->SetImage(image);
    // Get OCR result
    outText = api->GetUTF8Text();
    printf("OCR output:\n%s", outText);

    // Destroy used object and release memory
    api->End();
    delete api;
    delete [] outText;
    pixDestroy(&image);

    return 0;
}

api->Init(NULL, “eng”) 这句是加载 eng.traineddata ，NULL表示从默认的位置加载。当然也可以把eng.traineddata 的位置传进来。

如果我们还想同时加载其他的语言的训练数据可以这样写：api->Init(NULL, “eng+deu”)

这样就同时加载了英文和德文数据。

api->Init(NULL, “xxx”) 函数在程序中可以多次调用。每次调用后 OCR 引擎就被重新初始化。

api->SetImage(image); 这就是加载图像。之后我们还可以限制只对图像的一部分区域进行 OCR。类似下面这条语句：

api->SetRectangle(left, top, width, height) ;

api->GetUTF8Text() 获得 OCR 识别出的字符串。需要特别注意的是 GetUTF8Text() 返回的是 C 字符串，需要我们自己释放这个字符串的内存空间：

delete [] outText;

从这里也可以看出 Tesseract 比较原始，好歹应该返回个 std::string 啊。这样很容易造成内存泄漏。

一般在 OCR 之后还会看看识别的 confidence value 。

api->MeanTextConf();

这个值介于 0 到100 之间，越大说明识别正确的概率越大。

完事之后可以调用 api->End(); 来释放内存空间。

基本上这个例子就是一个最简单的用法。上面例子中用到了一个图片，我把图片放这里：

在我电脑上输出的结果如下：

1284567890 4934567890

This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.

The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.

可以看到有几个数字识别错了。如果用 SetRectangle 圈住那一串数字后再识别就全都可以识别正确。

下面再看一个高级些的例子：

  Pix *image = pixRead("phototest.png");
  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
  api->Init(NULL, "eng");
  api->SetImage(image);
  Boxa* boxes = api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL);
  printf("Found %d textline image components.\n", boxes->n);
  for (int i = 0; i < boxes->n; i++) {
    BOX* box = boxaGetBox(boxes, i, L_CLONE);
    api->SetRectangle(box->x, box->y, box->w, box->h);
    char* ocrResult = api->GetUTF8Text();
    int conf = api->MeanTextConf();
    fprintf(stdout, "Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s",
                    i, box->x, box->y, box->w, box->h, conf, ocrResult);
    boxDestroy(&box);
  }

这个例子可以将图片中的文字按行分割出来，利用的是下面这个函数：

api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL);

RIL_TEXTLINE 表示按行分割，除此之外还可以按段落（RIL_PARA）、单词（RIL_WORD）或者字符（RIL_WORD）分割。

上面的例子运行结果如下，可以看出识别率不高。

Found 9 textline image components.
Box[0]: x=42, y=33, w=321, h=33, confidence: 40, text: 123496 /890 1234567890
Box[1]: x=36, y=92, w=544, h=30, confidence: 92, text: This Is a lot of 12 point text to test the
Box[2]: x=36, y=126, w=582, h=31, confidence: 89, text: ocr code and see If it works on ail types
Box[3]: x=36, y=160, w=187, h=24, confidence: 88, text: of tie format.
Box[4]: x=36, y=194, w=549, h=31, confidence: 90, text: The quick brown dog Jumped over the
Box[5]: x=37, y=228, w=548, h=31, confidence: 75, text: lazy Tox. 1ne quick brown dog Jumped
Box[6]: x=36, y=262, w=561, h=31, confidence: 93, text: over the lazy fox. [he quick brown dog
Box[7]: x=43, y=296, w=518, h=31, confidence: 89, text: jumped over the lazy Tox. [ne quick
Box[8]: x=37, y=330, w=524, h=31, confidence: 82, text: brown dog Jumped over the lazy Tox.

之所以识别率不高，是因为 api->SetRectangle(box->x, box->y, box->w, box->h); 这句有点问题。如果改成下面这样：

api->SetRectangle(box->x, box->y-1, box->w, box->h+1);

识别率会提升很多。这时的结果如下：

Found 9 textline image components.
Box[0]: x=42, y=33, w=321, h=33, confidence: 91, text: 1234567890 1234567890
Box[1]: x=36, y=92, w=544, h=30, confidence: 95, text: This is a lot of 12 point text to test the
Box[2]: x=36, y=126, w=582, h=31, confidence: 95, text: ocr code and see if it works on all types
Box[3]: x=36, y=160, w=187, h=24, confidence: 94, text: of file format.
Box[4]: x=36, y=194, w=549, h=31, confidence: 95, text: The quick brown dog jumped over the
Box[5]: x=37, y=228, w=548, h=31, confidence: 93, text: lazy fox. The quick brown dog jumped
Box[6]: x=36, y=262, w=561, h=31, confidence: 95, text: over the lazy fox. The quick brown dog
Box[7]: x=43, y=296, w=518, h=31, confidence: 95, text: jumped over the lazy fox. The quick
Box[8]: x=37, y=330, w=524, h=31, confidence: 93, text: brown dog jumped over the lazy fox.

上面代码另一个问题是分配的字符串没有释放空间。所以正确的代码应该改成这样：

    Boxa* boxes = api->GetComponentImages(tesseract::RIL_TEXTLINE, true, NULL, NULL);
    printf("Found %d textline image components.\n", boxes->n);
    for (int i = 0; i < boxes->n; i++) 
    {
        BOX* box = boxaGetBox(boxes, i, L_CLONE);
        api->SetRectangle(box->x, box->y-1, box->w, box->h+1);
        char* ocrResult = api->GetUTF8Text();
        int conf = api->MeanTextConf();
        fprintf(stdout, "Box[%d]: x=%d, y=%d, w=%d, h=%d, confidence: %d, text: %s",
                    i, box->x, box->y, box->w, box->h, conf, ocrResult);
        delete [] ocrResult;      
        boxDestroy(&box);
    }

最后再看一个例子：

Pix *image = pixRead("phototest.png");
  tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
  api->Init(NULL, "eng");
  api->SetImage(image);
  api->Recognize(0);
  tesseract::ResultIterator* ri = api->GetIterator();
  tesseract::PageIteratorLevel level = tesseract::RIL_WORD;
  if (ri != 0) {
    do {
      const char* word = ri->GetUTF8Text(level);
      float conf = ri->Confidence(level);
      int x1, y1, x2, y2;
      ri->BoundingBox(level, &x1, &y1, &x2, &y2);
      printf("word: '%s';  \tconf: %.2f; BoundingBox: %d,%d,%d,%d;\n",
               word, conf, x1, y1, x2, y2);
      delete[] word;
    } while (ri->Next(level));
  }

这个代码和上面的代码差不多，只不过用了 Iterator。这里不多解释了。

程序运行的结果如下：

word: '1284567890';     conf: 64.73; BoundingBox: 42,33,170,50;
word: '4934567890';     conf: 56.32; BoundingBox: 190,47,363,66;
word: 'This';   conf: 96.59; BoundingBox: 36,92,96,116;
word: 'is';     conf: 96.92; BoundingBox: 109,92,129,116;
word: 'a';      conf: 96.33; BoundingBox: 141,98,156,116;
word: 'lot';    conf: 96.33; BoundingBox: 169,92,201,116;
word: 'of';     conf: 96.45; BoundingBox: 212,92,240,116;
word: '12';     conf: 96.45; BoundingBox: 251,92,282,116;
word: 'point';          conf: 96.47; BoundingBox: 296,92,364,122;
word: 'text';   conf: 96.47; BoundingBox: 374,93,427,116;
word: 'to';     conf: 96.88; BoundingBox: 437,93,463,116;
word: 'test';   conf: 96.98; BoundingBox: 474,93,526,116;
word: 'the';    conf: 96.37; BoundingBox: 536,92,580,116;
word: 'ocr';    conf: 96.07; BoundingBox: 36,132,81,150;
word: 'code';   conf: 96.07; BoundingBox: 91,126,160,150;
word: 'and';    conf: 96.62; BoundingBox: 172,126,223,150;
word: 'see';    conf: 96.53; BoundingBox: 236,132,286,150;
word: 'if';     conf: 94.37; BoundingBox: 299,126,314,150;
word: 'it';     conf: 94.37; BoundingBox: 325,126,339,150;
word: 'works';          conf: 95.96; BoundingBox: 348,126,433,150;
word: 'on';     conf: 93.54; BoundingBox: 445,132,478,150;
word: 'all';    conf: 93.54; BoundingBox: 500,126,529,150;
word: 'types';          conf: 96.90; BoundingBox: 541,127,618,157;
word: 'of';     conf: 96.23; BoundingBox: 36,160,64,184;
word: 'file';   conf: 95.72; BoundingBox: 72,160,113,184;
word: 'format.';        conf: 95.68; BoundingBox: 123,160,223,184;
word: 'The';    conf: 96.51; BoundingBox: 36,194,91,218;
word: 'quick';          conf: 96.63; BoundingBox: 102,194,177,224;
word: 'brown';          conf: 96.82; BoundingBox: 189,194,274,218;
word: 'dog';    conf: 95.79; BoundingBox: 287,194,339,225;
word: 'jumped';         conf: 95.79; BoundingBox: 348,194,456,225;
word: 'over';   conf: 96.60; BoundingBox: 468,200,531,218;
word: 'the';    conf: 96.49; BoundingBox: 540,194,585,218;
word: 'lazy';   conf: 96.40; BoundingBox: 37,228,92,259;
word: 'fox.';   conf: 96.44; BoundingBox: 103,228,153,252;
word: 'The';    conf: 96.70; BoundingBox: 165,228,220,252;
word: 'quick';          conf: 96.63; BoundingBox: 232,228,307,258;
word: 'brown';          conf: 96.62; BoundingBox: 319,228,404,252;
word: 'dog';    conf: 95.80; BoundingBox: 417,228,468,259;
word: 'jumped';         conf: 95.80; BoundingBox: 478,228,585,259;
word: 'over';   conf: 96.29; BoundingBox: 36,268,99,286;
word: 'the';    conf: 96.28; BoundingBox: 109,262,153,286;
word: 'lazy';   conf: 96.51; BoundingBox: 165,262,221,293;
word: 'fox.';   conf: 96.30; BoundingBox: 231,262,281,286;
word: 'The';    conf: 96.65; BoundingBox: 294,262,349,286;
word: 'quick';          conf: 96.61; BoundingBox: 360,262,435,292;
word: 'brown';          conf: 96.12; BoundingBox: 447,262,532,286;
word: 'dog';    conf: 96.12; BoundingBox: 545,262,597,293;
word: 'jumped';         conf: 96.73; BoundingBox: 43,296,150,327;
word: 'over';   conf: 96.38; BoundingBox: 162,302,226,320;
word: 'the';    conf: 96.38; BoundingBox: 235,296,279,320;
word: 'lazy';   conf: 96.80; BoundingBox: 292,296,347,327;
word: 'fox.';   conf: 96.77; BoundingBox: 357,296,407,320;
word: 'The';    conf: 96.17; BoundingBox: 420,296,475,320;
word: 'quick';          conf: 96.95; BoundingBox: 486,296,561,326;
word: 'brown';          conf: 96.83; BoundingBox: 37,330,122,354;
word: 'dog';    conf: 96.32; BoundingBox: 135,330,187,361;
word: 'jumped';         conf: 96.80; BoundingBox: 196,330,304,361;
word: 'over';   conf: 96.95; BoundingBox: 316,336,379,354;
word: 'the';    conf: 96.56; BoundingBox: 388,330,433,354;
word: 'lazy';   conf: 95.99; BoundingBox: 445,330,500,361;
word: 'fox.';   conf: 96.61; BoundingBox: 511,330,561,354;

可以看出还是有识别错误的。对于这些识别错误的，可以记录下位置，稍微扩大些范围，利用 SetRectangle 重新识别。但是一定不要在 Iterator 迭代时做这个事情。因为重新识别会破坏 Iterator 的状态。

C++ 调用 Tesseract

猜你喜欢