linux安装和使用tesseract（C++ api）

tesseract是一个基于C++编写的开源OCR（光学字符识别）库，对于做机器学习中NLP领域有很重要的作用，某些时候，为了方便可以不需要从头搭建模型训练OCR，那么采用开源框架就是一个便捷的做法

本文简单介绍一下linux系统中安装和使用tesseract以及调用该库的C++ api进行开发

此处用的linux发行版是ubuntu 14.04，其他环境同理

下载

理论上是可以对于所有依赖都下载源码编译安装，但是为了规避其中各种奇怪的坑，这里直接通过linux的包管理器下载安装全部依赖库

依赖列表

png，jepg，tiff 图像格式解析基础库
leptonica图像处理开发库
tesseract 光学字符识别核心开发库
tessdata 识别用的已训练数据集

安装步骤

step1

sudo apt-get install libpng12-dev

sudo apt-get install libjpeg8-dev

sudo apt-get install libtiff5-dev

如果还缺少依赖，可以加上

sudo apt-get install zlib1g-dev

step2

sudo apt-get install libleptonica-dev

step3

sudo apt install libtesseract-dev

此时，已经安装完毕，默认安装目录如下：

头文件目录：

/usr/include/leptonica
/usr/include/tesseract

库目录：

/usr/lib 里面有 liblept.so
/usr/lib 里面有 libtesseract.a 和 libtesseract.so

step4

对于已训练数据集，默认安装好tesseract后会在 /usr/share/tesseract-ocr/tessdata这个目录里面

类似于 eng.trainedddata，表示英文字符集

可以在github上下载对应的其他语言的已训练数据复制到这个目录，该目录会在程序运行时自动寻找到

tessdata下载地址：tessdata

工具使用

使用命令行工具测试

准备好png，tif，jpg，bmp格式的文字测试图片，例如 phototest.png

输入命令行

tesseract phototest.png png_res

即可生成对应的文本文件，其中图片的后缀可以改成多种支持的格式，默认生成txt格式的输出文件

.
├── bmp_res.txt
├── jpg_res.txt
├── phototest.bmp
├── phototest.jpg
├── phototest.png
├── phototest.tif
├── png_res.txt
└── tif_res.txt

识别结果

The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.

编程使用

tesseract开放了C++ api供调用，方便集成到已有的项目中去，这里写了个简单的示例，参照官网

项目结构

.
├── build
│   ├── Makefile
│   ├── ocr_cpp_test
├── CMakeLists.txt
├── phototest.png
├── src
    └── main.cpp

main.cpp

#include <tesseract/baseapi.h>
#include <leptonica/allheaders.h>

int main()
{
    char *outText;

    tesseract::TessBaseAPI *api = new tesseract::TessBaseAPI();
    // Initialize tesseract-ocr with English, without specifying tessdata path
    if (api->Init(NULL, "eng")) {
        fprintf(stderr, "Could not initialize tesseract.\n");
        exit(1);
    }

    // Open input image with leptonica library
    Pix *image = pixRead("../phototest.png");
    api->SetImage(image);
    // Get OCR result
    outText = api->GetUTF8Text();
    printf("OCR output:\n%s", outText);

    // Destroy used object and release memory
    api->End();
    delete [] outText;
    pixDestroy(&image);

    return 0;
}

cmake配置文件

project(ocr_cpp_test)  
cmake_minimum_required(VERSION 2.8)  
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -std=c++11 -W")  
  
include_directories(  
    /usr/include    
)  
  
aux_source_directory(./src DIR_SRCS)  
  
link_directories(/usr/lib)  
  
add_executable(ocr_cpp_test ${DIR_SRCS})  
target_link_libraries(ocr_cpp_test 
    tesseract
    lept
)

编译运行结果：

OCR output:
This is a lot of 12 point text to test the
ocr code and see if it works on all types
of file format.

The quick brown dog jumped over the
lazy fox. The quick brown dog jumped
over the lazy fox. The quick brown dog
jumped over the lazy fox. The quick
brown dog jumped over the lazy fox.

在实际项目中也可以把tessdata文件夹放置在可执行文件同目录，方便管理