http://code.google.com/p/tesseract-ocr/
https://stackoverflow.com/questions/2363490/limit-characters-tesseract-is-looking-for
http://resources.infosecinstitute.com/case-study-cracking-online-banking-captcha-login-using-python/
1. 安装:
apt-get install tesseract-ocr
2. 预先处理图片,代码片段:
from PIL import Image import os import time
def crack(cap_name): img = Image.open(cap_name+'.JPEG') img = img.convert("RGB") pixdata = img.load() for y in xrange(img.size[1]): for x in xrange(img.size[0]): if pixdata[x, y][0] < 90: pixdata[x, y] = (0, 0, 0, 255) for y in xrange(img.size[1]): for x in xrange(img.size[0]): if pixdata[x, y][1] < 136: pixdata[x, y] = (0, 0, 0, 255) for y in xrange(img.size[1]): for x in xrange(img.size[0]): if pixdata[x, y][2] > 0: pixdata[x, y] = (255, 255, 255, 255) ext = ".tif" img.save(cap_name + ext)
3. 使用tesseract命令识别图片:
tesseract imagename outbase [-l lang] [-psm N] [configfile ...]
引用
0 = Orientation and script detection (OSD) only.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
1 = Automatic page segmentation with OSD.
2 = Automatic page segmentation, but no OSD, or OCR
3 = Fully automatic page segmentation, but no OSD. (Default)
4 = Assume a single column of text of variable sizes.
5 = Assume a single uniform block of vertically aligned text.
6 = Assume a single uniform block of text.
7 = Treat the image as a single text line.
8 = Treat the image as a single word.
9 = Treat the image as a single word in a circle.
10 = Treat the image as a single character.
4. 限制Tesseract搜索的字符
1)在tessdata/configs文件夹中创建一个新的配置文件
2)在配置文件中添加如下:
引用
tessedit_char_whitelist abcdefghijklmnopqrstuvwxyz
3. 使用新建的配置文件调用tessdata命令。
5. 训练Tesseract识图能力
参考文章:
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract2
http://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3