图形验证码

图形验证码

安装,配置及连接

OCR，即Optical Character Recognition，光学字符识别，是指通过扫描字符，然后通过其形状将其翻译成电子文本的过程

tesserocr是Python的一个OCR识别库，是对tesseract做的一层Python API封装，它的核心是tesseract。需要先安装tesserac再安装tesserocr

tesserocr GitHub：https://github.com/sirfz/tesserocr

tesserocr PyPI：https://pypi.python.org/pypi/tesserocr

tesseract下载地址：http://digi.bib.uni-mannheim.de/tesseract

tesseract GitHub：https://github.com/tesseract-ocr/tesseract

tesseract语言包：https://github.com/tesseract-ocr/tessdata

tesseract文档：https://github.com/tesseract-ocr/tesseract/wiki/Documentation

安装tesseract:sudo apt-get install -y tesseract-ocr libtesseract-dev libleptonica-dev

查看支持语言:tesseract --list-langs

安装语言包:

git clone --depth=1 https://github.com/tesseract-ocr/tessdata.git
sudo mv tessdata/* /usr/share/tesseract-ocr/tessdata

安装tsserocr:pip3 install tesserocr pillow

验证:

tesseract image.png result -l eng && cat result.txt  # tesseract命令，其中第一个参数为图片名称，第二个参数result为结果保存的目标文件名称，-l指定使用的语言包，在此使用英（eng）

识别

可从知网注册页面获取图形验证码:http://my.cnki.net/elibregister/commonRegister.aspx

import tesserocr
from PIL import Image
image = Image.open('image.png')
print(tesserocr.image_to_text(image))  # tesserocr的image_to_text()方法，再将其识别结果输出,识别效果更好

import tesserocr
print(tesserocr.file_to_text('image.png'))

减少干扰

# 将图片转化为灰度图像

image.show()
image = image.convert('L')

# 二值化,小于阀值的为0,大于阀值的为1,黑白来说,0为黑色
threshold = 80
table = []
for i in range(256):
    if i < threshold:
        table.append(0)
    else:
        table.append(1)
image = image.point(table ,'1' ) 
image.show() 

# 更简洁的二值化
image = iamge.point(lambda i: i >160 and 255, '1')

简单图形验证码识别

图形验证码

安装,配置及连接

识别

减少干扰

猜你喜欢