字符型图片验证码的识别

概述

需求

访问地址：http://shixin.csrc.gov.cn/honestypub
每日请求量：万级
请求成功率要求：80%以上
请求反馈速度：5s以内

方案设计

操作系统：CentOS7
编程语言: python3（flask）
爬虫：selenium
图片识别：tesseract-OCR,PILLOW,jTessBoxEditor
关键技术点：验证码识别
baseline
tesseract-OCR（lang:eng）:准确率200/376 = 53.19%
改进方案
（1）图片点降噪；（2）基于tesseract训练工具的样本训练
改进结果
训练样本30——准确率：302/392=77.04%；训练样本90——准确率：300/352=85.23%

python相关配置

请求服务相应包的安装

Flask
pip install Flask
request库
pip install requests

selenium库配置

安装selenium库
pip install selenium
安装firefox驱动
https://github.com/mozilla/geckodriver/releases
解压geckodriver文件到/usr/sbin/目录下（注意与firefox版本匹配，更新firefox命令:yum install firefox）

图片验证码识别包安装

pip install pytesseract
pip install PILLOW

tesseract-OCR编译安装

（1）leptonica编译安装
下载地址：http://www.leptonica.com/download.html
编译安装：

tar zxvf leptonica-1.75.3.tar.gz
cd leptonica-1.75.3
./configure 
make&&make install

（2）tesseract编译安装

wget https://github.com/tesseract-ocr/tesseract/archive/tesseract-3.05.01.tar.gz
tar zxvf tesseract-3.05.01.tar.gz
cd tesseract-3.05.01
./autogen.sh
make
make install
ldconfig

报错问题及解决方案：参考https://it.baiked.com/system/ops/2291.html
测试是否成功：
tesseract

验证码识别

抓取相应网页

# 引入相关package
from PIL import Image,ImageEnhance
import pytesseract
from selenium import webdriver
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support import ui
from selenium.webdriver.firefox.options import Options

# 爬虫抓取代码： 略

获取验证码图像

def get_picture(driver):
    global cache
    if cache == []:
        captchaElem = driver.find_element_by_xpath('//*[@id="captcha_img"]')
        captchaX = int(captchaElem.location['x'])
        captchaY = int(captchaElem.location['y'])
        captchaWidth = captchaElem.size['width']
        captchaHeight = captchaElem.size['height']
        captchaRight = captchaX + captchaWidth
        captchaBottom = captchaY + captchaHeight
        cache = [captchaX,captchaY,captchaRight,captchaBottom]
    else:
        captchaX,captchaY,captchaRight,captchaBottom = cache

    driver.get_screenshot_as_file("screenshot.png")
    imgObject = Image.open("screenshot.png")

    im = imgObject.crop((captchaX, captchaY, captchaRight, captchaBottom))
    
    return im

图像预处理

def im_process(im):
    # 将RGB彩图转为灰度图
    gray = im.convert('L')
    # 将灰度图按照设定阈值转化为二值图
    gray = gray.point(lambda x: 0 if x < 100 else 1, '1')
    
    return gray

图像去噪

转化为二值图片后，就需要清除噪点。本项目图片比较简单，大部分噪点也是最简单的那种孤立点，所以可以通过检测这些孤立点就能移除大量的噪点。
关于如何去除更复杂的噪点甚至干扰线和色块，有比较成熟的算法: 洪水填充法 Flood Fill ，本文为了问题简单化，选用较为简单的九宫格去噪法：

对某个黑点周边的九宫格里面的黑色点计数;
如果黑色点少于2个则证明此点为孤立点，然后得到所有的孤立点;
对所有孤立点一次批量移除。

具体实现算法如下：

## 降噪
def sum_9_region_new(img, x, y):
    '''确定噪点 '''
    cur_pixel = img.getpixel((x, y))  # 当前像素点的值
    width = img.width
    height = img.height
 
    if cur_pixel == 1:  # 如果当前点为白色区域,则不统计邻域值
        return 0
 
    # 因当前图片的四周都有黑点，所以周围的黑点可以去除
    if y < 2:  # 本例中，前两行的黑点都可以去除
        return 1
    elif y == height - 1:  # 最下面一行
        if x < 1 or x == width - 1:
            return 1
        else:
            sum = img.getpixel((x - 1, y - 1)) \
                      + img.getpixel((x - 1, y)) \
                      + img.getpixel((x, y - 1)) \
                      + cur_pixel \
                      + img.getpixel((x + 1, y - 1)) \
                      + img.getpixel((x + 1, y)) 
            return 6 - sum 
    else:  # y不在边界
        if x < 2:  # 前两列
            return 1
        elif x == width - 1:  # 右边非顶点
            sum = img.getpixel((x - 1, y - 1)) \
                  + img.getpixel((x - 1, y)) \
                  + img.getpixel((x - 1, y + 1)) \
                  + img.getpixel((x, y - 1)) \
                  + cur_pixel \
                  + img.getpixel((x, y + 1)) 
            return 6 - sum
        else:  # 具备9领域条件的
            sum = img.getpixel((x - 1, y - 1)) \
                  + img.getpixel((x - 1, y)) \
                  + img.getpixel((x - 1, y + 1)) \
                  + img.getpixel((x, y - 1)) \
                  + cur_pixel \
                  + img.getpixel((x, y + 1)) \
                  + img.getpixel((x + 1, y - 1)) \
                  + img.getpixel((x + 1, y)) \
                  + img.getpixel((x + 1, y + 1))
            return 9 - sum

 
def collect_noise_point(img):
    '''收集所有的噪点'''
    noise_point_list = []
    for x in range(img.width):
        for y in range(img.height):
            res_9 = sum_9_region_new(img, x, y)
            if (0 < res_9 < 3) and img.getpixel((x, y)) == 0:  # 找到孤立点
                pos = (x, y)
                noise_point_list.append(pos)
    return noise_point_list
 
def remove_noise_pixel(img, noise_point_list):
    '''根据噪点的位置信息，消除二值图片的黑点噪声'''
    for item in noise_point_list:
        img.putpixel((item[0], item[1]), 1)

验证码识别

# 解析验证图像字符
def parse_ycode(im):
    # 预处理
    im = im_process(im)
    # 去噪
    noise_point_list = collect_noise_point(im)
    remove_noise_pixel(im, noise_point_list)
    #  pytesseract识别验证码
    yanzhengma = pytesseract.image_to_string(im,lang='eng')
    yanzhengma = list(filter(str.isalnum, str(yanzhengma)))

    if len(yanzhengma) == 5:
        return yanzhengma       
    else:
        im.save(SAVE_FILE+'0-'+''.join(yanzhengma)+'.jpg')
        return None

Tesseract-OCR训练

安装依赖包

yum install cairo-devel pango-devel libicu-devel

编译安装训练工具

tesseract根目录下，执行以下命令：

./configure
make training
make training-install

安装jTessBoxEditor

jTessBoxEditor需要jre7（Java Runtime Environment）以上的版本支持。
安装完jre后，下载jTessBoxEditor，解压，运行train.bat文件即可运行。
具体操作可参考：https://www.jianshu.com/p/5c8c6b170f6f

合成图片

返回到win系统上，运行jTessBoxEditor工具，把所有图片合成一张.tif格式的图片（命名为[lang].[fontname].exp[num].tif）。

生成box文件

在tif文件所在的目录下打开一个命令行，产生相应的Box文件（*.box）来生成一个box文件，该文件记录了tesseract识别出来的每一个字和其位置坐标。
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] batch.nochop makebox

修正文字内容

把[lang].[fontname].exp[num].box下载下来，放到win系统下，放到之前[lang].[fontname].exp[num].tif目录下。
使用jTessBoxEditor开始修正文字。
具体修正操作可参考：https://www.jianshu.com/p/5c8c6b170f6f

开始训练

将修正后的box文件替换掉原始box文件。执行下面操作：
tesseract [lang].[fontname].exp[num].tif [lang].[fontname].exp[num] box.train.stderr
这一步会生成两个文本文件：[lang].[fontname].exp[num].tr 和 [lang].[fontname].exp[num].txt，后者只有一些换行符，前者对应于box文件中各字符在tif图片文件中的形状信息，记录的方式实际上是将一个字符看成是一个多边形，而tr文件记录的就是多边形每条边的位置、方向、长度等信息。

生成字符集信息

unicharset_extractor [lang].[fontname].exp[num].box
这一步会生成一个名为unicharset的文本文件，正如其名字表明的，这个文件记录的是一个字符集，它存有box文件里面不重复的字符信息，每个单独字符占一行。

创建字体信息文件

我们可以训练tesseract识别同一种语言的不同字体（这里只训练一种字体），我们需要提供字体相关的特性，这是通过一个叫做font_properties的文本文件标明的。这个文件的每一行以如下格式记录了一个字体的信息：
<fontname> <italic> <bold> <fixed> <serif> <fraktur>
本文的训练中使用了名为Boldface的字体，因此font_properties里面需要有一行以Boldface开头的字体信息。
除了手动创建这个文件外，tesseract-ocr源码中也提供了一个这样的font_properties文件（training/langdata/font_properties），并且里面已经有了很多字体的信息，因此这里就不许要手动创建了，后面的步骤要用的这个文件的时候，直接指定使用这个文件就行了。例如：
Boldface 0 1 0 0 0

聚合

首先使用shapeclustering
shapeclustering -F font_properties -U unicharset [lang].[fontname].exp[num].tr
这一步会输出一个名为shapetable的文件，下一步的mftraining会自动在当前目录加载这个文件。

接下来执行mftraining
mftraining -F font_properties -U unicharset -O [lang].unicharset [lang].[fontname].exp[num].tr
输出结果有警告，但不影响，执行完成后会生成三个文件：[lang].unicharset, inttemp, pffmtable

最后执行cntraining
cntraining [lang].[fontname].exp[num].tr
这一步生成一个名为normproto的文件

合并生成traineddata文件

现在你只需要合并所有的文件(shapetable, normproto, inttemp, pffmtable)，用相同的前缀重命名它们，如lang.。
combine_tessdata lang.
注意：不要忘记最后一个点！
将生成的[lang].traineddata放在tessdata目录下。然后你就可以用你训练的语言去测试了：

tesseract --list-langs
tesseract image.tif output -l lang