【方法】Tesseract-OCR使用心得

Goggle的Tesseract是目前OCR领域最强大的开源项目了，我将在这里介绍Windows环境下的使用说明：

官方网站：https://github.com/tesseract-ocr/tesseract

帮助介绍：https://github.com/tesseract-ocr/tesseract/wiki

参数解释：https://github.com/tesseract-ocr/tesseract/wiki/ControlParams

数据文件：https://github.com/tesseract-ocr/tesseract/wiki/Data-Files

Windows下载地址：https://github.com/UB-Mannheim/tesseract/wiki ，尽量下载最新版，目前是4.0.0。

如果想要在全局使用tesseract指令，你需要将安装tesseract的文件夹地址添加到环境变量里面。

一、最简单的上手

打开命令行窗口，输入：

tesseract imagename outputbase

你需要在工作目录下放一个图片，其中imagename为图片名称（需要后缀），outputbase为输出文件的名称（可以自己随意），运行结果就是在图片的目录下新产生一个名为outputbase的txt文件。

上述指令默认是英文检测，如果你想要检测中文，需要先下载中文的包，在上面的数据文件链接下载中文的包，添加到tesseract文件夹下的tessdata文件夹里面。执行：

tesseract imagename outputbase -l chi_sim

就能检测中文了。

二、优化识别

1、首先可以添加分页参数--psm，命令为

tesseract imagename outputbase -l chi_sim --psm 6

参数6，可以改从1到13，你可以自己尝试哪一个效果最好。官方的解释是

  0    Orientation and script detection (OSD) only.
  1    Automatic page segmentation with OSD.
  2    Automatic page segmentation, but no OSD, or OCR.
  3    Fully automatic page segmentation, but no OSD. (Default)
  4    Assume a single column of text of variable sizes.
  5    Assume a single uniform block of vertically aligned text.
  6    Assume a single uniform block of text.
  7    Treat the image as a single text line.
  8    Treat the image as a single word.
  9    Treat the image as a single word in a circle.
 10    Treat the image as a single character.
 11    Sparse text. Find as much text as possible in no particular order.
 12    Sparse text with OSD.
 13    Raw line. Treat the image as a single text line,
       bypassing hacks that are Tesseract-specific.

2、优化图片本身

一般进行的操作包括二值化等等，参考官方的https://github.com/tesseract-ocr/tesseract/wiki/ImproveQuality

三、进行模型训练

1、传统操作

请查阅其他人的博客，先安装一个jTessBoxEditorFX，然后把多张图片处理成一个tif文件，然后执行：

tesseract chi_my.font.exp0.tif chi_my.font.exp0 -l chi_sim batch.nochop makebox
echo font 0 0 0 0 0>font_properties
tesseract chi_my.font.exp0.tif chi_my.font.exp0 nobatch box.train
unicharset_extractor chi_my.font.exp0.box
mftraining -F font_properties -U unicharset -O chi_my.unicharset chi_my.font.exp0.tr
cntraining chi_my.font.exp0.tr

2、最新操作

传统操作是tesseract3.0版本时候的训练手段，在最新的4.0里面，使用的LSTM模型，之前只能是单字单字的训练，训练得你怀疑人生，好在4.0版本横空出世，让你可以直接训练一整句话了。

...未完待续，有时间再写。