OCR文字识别

OCR(Optical Character Recognition):光学字符识别,是指对图片文件中的文字进行分析识别，获取的过程。 Tesseract：开源的OCR识别引擎，初期Tesseract引擎由HP实验室研发，后来贡献给了开源软件业，后经由Google进行改进，消除bug，优化，重新发布。

http://code.google.com/p/tesseract-ocr/

Summary:Tesseract is probably the most accurate open source OCR engine available. Combined with the Leptonica Image Processing Library it can read a wide variety of image formats and convert them to text in over 60 languages. It was one of the top 3 engines in the 1995 UNLV Accuracy test. Between 1995 and 2006 it had little work done on it, but since then it has been improved extensively by Google. It is released under the Apache License 2.0.

Supported Platforms:Tesseract works on Linux, Windows (with VC++ Express or CygWin) and Mac OSX. See the ReadMe for more details and install instructions. It can also be compiled for other platforms, including Android and the iPhone, though these are not as well tested platforms. See also the AddOns page for other projects using Tesseract on various platforms.

----------------------------------------------------------------------------------------------

1、linux安装tesseract，http://code.google.com/p/tesseract-ocr/wiki/Compiling

-----
#install dependent package
sudo apt-get install autoconf automake libtool
sudo apt-get install libpng12-dev
sudo apt-get install libjpeg62-dev
sudo apt-get install libtiff4-dev
sudo apt-get install zlib1g-dev
sudo apt-get install libleptonica-dev
------
sudo apt-get install g++ 
#g++ --version
------
#install tesserocr
./autogen.sh
./configure
make
sudo make install
sudo ldconfig
------
--install language
cd /usr/local/share/tessdata
cp eng.traineddata /usr/local/share/tessdata

2、测试，总体识别率不算高，第1种数字识别率不错，第2种类型的验证码以‘-psm 6’参数得出的识别率更高

1）

➜  Downloads  tesseract test.png aa         
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
➜  Downloads  more aa.txt 
0376

➜  Downloads  tesseract test1.jpg 1 
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
Empty page!!
Empty page!!
➜  Downloads  more 1.txt

➜  Downloads  tesseract test1.jpg 1 -psm 7    
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
➜  Downloads  more 1.txt
EMsi~\

➜  Downloads  tesseract test7.jpg 7 -psm 6
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
➜  Downloads  more 7.txt 
9u2E

➜  Downloads  tesseract test2.jpg 2
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
➜  Downloads  more 2.txt 
F KASKN

3、补充

1）安装tesseract时，不执行ldconfig命令，会报error while loading shared libraries: xxx.so.x

原因参考：http://hi.baidu.com/longquan302/item/3e3a82102f77565c7b5f251b

2）tesseract语言包下载地址，http://code.google.com/p/tesseract-ocr/downloads/list

3）第3方基于tesseract-ocr开发的工具，http://code.google.com/p/tesseract-ocr/wiki/3rdParty

4）中文安装说明，http://www.linuxidc.com/Linux/2011-07/38728.htm

5）tesseract用法

Usage:tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfile...]
例如：  tesseract code.jpg result  -l chi_sim -psm 7 nobatch  
-l chi_sim 表示用简体中文字库（需要下载中文字库文件，解压后，存放到tessdata目录下去,字库文件扩展名为.raineddata 简体中文字库文件名为:  chi_sim.traineddata）  
-psm 7 表示告诉tesseract code.jpg图片是一行文本，这个参数可以减少识别错误率， 默认为 3
configfile 参数值为tessdata\configs 和  tessdata\tessconfigs 目录下的文件名。

6）java调用tesseract-ocr， http://blog.sina.com.cn/s/blog_025270e90101avgb.html

7）windows下使用tesseract-ocr，http://blog.csdn.net/xiaochunyong/article/details/7193744

8）仅识别数字，tesseract imagename outputbase digits

猜你喜欢