python-利用tesseract获取验证码中的数字

 

利用tesseract获取验证码中的数字:

1.安装PIL-fork-1.1.7.win-amd64-py2.7

2.安装Pillow-4.3.0.win-amd64-py2.7

3.pip install pyocr

pip install pytesseract

4.安装ocr工具:tesseract-ocr-setup-4.00.00dev.exe

http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-4.00.00dev.exe

语言包:

简体字识别包:https://raw.githubusercontent.com/tesseract-ocr/tessdata/4.00/chi_sim.traineddata

繁体字识别包:https://github.com/tesseract-ocr/tessdata/raw/4.0/chi_tra.traineddata

其他语言的识别包https://github.com/tesseract-ocr/tesseract/wiki/Data-Files。

5.配置tesseract-ocr环境变量

C:\Program Files (x86)\Tesseract-OCR

6.新增TESSDATA_PREFIX环境变量

变量名:TESSDATA_PREFIX

值:C:\Program Files (x86)\Tesseract-OCR\tessdata

7.python代码

# coding: utf-8

import pytesseract

import sys

import pyocr.builders

import pyocr

from PIL import Image

def image_to_str(vfile):

    tools = pyocr.get_available_tools()

    if len(tools) == 0:

        print("No OCR tool found")

        sys.exit(1)

    langs = tools[0].get_available_languages()

    txt = tools[0].image_to_string(

        Image.open(vfile),

        lang=langs[0],

        builder=pyocr.builders.TextBuilder()

    )

    print txt.replace(" ", "")

    return txt.replace(" ", "")

def image_to_str_by_pytesseract(vfile):

    image = Image.open(vfile)

    code = pytesseract.image_to_string(image)

    print code.replace(" ", "")

    return code.replace(" ", "")

if __name__ == '__main__':

    file1 = u'D:\WORK\python包&OCR\验证码.jpg'

    file2 = u'D:\WORK\python包&OCR\验证码2.jpg'

    file3 = u'D:\WORK\python包&OCR\验证码3.jpg'

    image_to_str(file1)

    image_to_str_by_pytesseract(file2)

    image_to_str_by_pytesseract(file3)

8.pycharm中执行注意,需要添加对应变量,不然在pycharm中执行会报错

 图片见附件

9.方法2引用pytesseract报错解决:

进入C:\Python27\Lib\site-packages\pytesseract

打开pytesseract.py

修改:

try:

    import Image

except ImportError:

    from PIL import Image

修改为:from PIL import Image即可

 

 

猜你喜欢

转载自youyou888856.iteye.com/blog/2404107