利用tesseract获取验证码中的数字:
1.安装PIL-fork-1.1.7.win-amd64-py2.7
2.安装Pillow-4.3.0.win-amd64-py2.7
3.pip install pyocr
pip install pytesseract
4.安装ocr工具:tesseract-ocr-setup-4.00.00dev.exe
http://digi.bib.uni-mannheim.de/tesseract/tesseract-ocr-setup-4.00.00dev.exe
语言包:
简体字识别包:https://raw.githubusercontent.com/tesseract-ocr/tessdata/4.00/chi_sim.traineddata
繁体字识别包:https://github.com/tesseract-ocr/tessdata/raw/4.0/chi_tra.traineddata
其他语言的识别包https://github.com/tesseract-ocr/tesseract/wiki/Data-Files。
5.配置tesseract-ocr环境变量
C:\Program Files (x86)\Tesseract-OCR
6.新增TESSDATA_PREFIX环境变量
变量名:TESSDATA_PREFIX
值:C:\Program Files (x86)\Tesseract-OCR\tessdata
7.python代码
# coding: utf-8
import pytesseract
import sys
import pyocr.builders
import pyocr
from PIL import Image
def image_to_str(vfile):
tools = pyocr.get_available_tools()
if len(tools) == 0:
print("No OCR tool found")
sys.exit(1)
langs = tools[0].get_available_languages()
txt = tools[0].image_to_string(
Image.open(vfile),
lang=langs[0],
builder=pyocr.builders.TextBuilder()
)
print txt.replace(" ", "")
return txt.replace(" ", "")
def image_to_str_by_pytesseract(vfile):
image = Image.open(vfile)
code = pytesseract.image_to_string(image)
print code.replace(" ", "")
return code.replace(" ", "")
if __name__ == '__main__':
file1 = u'D:\WORK\python包&OCR\验证码.jpg'
file2 = u'D:\WORK\python包&OCR\验证码2.jpg'
file3 = u'D:\WORK\python包&OCR\验证码3.jpg'
image_to_str(file1)
image_to_str_by_pytesseract(file2)
image_to_str_by_pytesseract(file3)
8.pycharm中执行注意,需要添加对应变量,不然在pycharm中执行会报错
图片见附件
9.方法2引用pytesseract报错解决:
进入C:\Python27\Lib\site-packages\pytesseract
打开pytesseract.py
修改:
try:
import Image
except ImportError:
from PIL import Image
修改为:from PIL import Image即可