python-利用tesseract获取验证码中的数字

利用tesseract获取验证码中的数字：

1.安装PIL-fork-1.1.7.win-amd64-py2.7

2.安装Pillow-4.3.0.win-amd64-py2.7

3.pip install pyocr

pip install pytesseract

4.安装ocr工具：tesseract-ocr-setup-4.00.00dev.exe

http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-4.00.00dev.exe

语言包：

简体字识别包：https://raw.githubusercontent.com/tesseract-ocr/tessdata/4.00/chi_sim.traineddata

繁体字识别包：https://github.com/tesseract-ocr/tessdata/raw/4.0/chi_tra.traineddata

其他语言的识别包https://github.com/tesseract-ocr/tesseract/wiki/Data-Files。

5.配置tesseract-ocr环境变量

C:\Program Files (x86)\Tesseract-OCR

6.新增TESSDATA_PREFIX环境变量

变量名：TESSDATA_PREFIX

值：C:\Program Files (x86)\Tesseract-OCR\tessdata

7.python代码

# coding: utf-8

import pytesseract

import sys

import pyocr.builders

import pyocr

from PIL import Image

def image_to_str(vfile):

tools = pyocr.get_available_tools()

if len(tools) == 0:

print("No OCR tool found")

sys.exit(1)

langs = tools[0].get_available_languages()

txt = tools[0].image_to_string(

Image.open(vfile),

lang=langs[0],

builder=pyocr.builders.TextBuilder()

)

print txt.replace(" ", "")

return txt.replace(" ", "")

def image_to_str_by_pytesseract(vfile):

image = Image.open(vfile)

code = pytesseract.image_to_string(image)

print code.replace(" ", "")

return code.replace(" ", "")

if __name__ == '__main__':

file1 = u'D:\WORK\python包&OCR\验证码.jpg'

file2 = u'D:\WORK\python包&OCR\验证码2.jpg'

file3 = u'D:\WORK\python包&OCR\验证码3.jpg'

image_to_str(file1)

image_to_str_by_pytesseract(file2)

image_to_str_by_pytesseract(file3)

8.pycharm中执行注意，需要添加对应变量，不然在pycharm中执行会报错

图片见附件

9.方法2引用pytesseract报错解决：

进入C:\Python27\Lib\site-packages\pytesseract

打开pytesseract.py

修改：

try:

import Image

except ImportError:

from PIL import Image

修改为：from PIL import Image即可

python-利用tesseract获取验证码中的数字

猜你喜欢