CNN大战验证码

介绍

爬虫江湖，风云再起。自从有了爬虫，也就有了反爬虫；自从有了反爬虫，也就有了反反爬虫。
反爬虫界的一大利器，就是验证码（CAPTCHA），各种各样的验证码让人眼花缭乱，也让很多人在爬虫的过程知难而返，从入门到放弃，当然，这就达到了网站建设者们的目的。但是，但是，所谓的验证码，并不是牢不可破的，在深度学习（Deeping Learning）盛行的今天，很多简单的验证码也许显得不堪一击。
本文将会介绍如何利用Python,OpenCV和CNN来攻破一类验证码，希望能让大家对Deeping Learning的魅力有些体会。

获取数据

笔者收集了某个账号注册网站的验证码，一共是346个验证码，如下：

验证码数据集

可以看到，这些验证码由大写字母和数字组成，噪声较多，而且部分字母会黏连在一起。

标记数据

仅仅用这些验证码是无法建模的，我们需要对这些验证码进行预处理，以符合建模的标准。
验证码的预处理方法见博客： OpenCV入门之获取验证码的单个字符（二），然后对每张图片进行标记，将它们放入到合适到文件夹中。没错，你没看错，就是对每张图片进行一一标记，笔者一共花了3个小时多，o(╥﹏╥)o~（为了建模，前期的数据标记是不可避免的，当然，也是一个痛苦的过程，比如WordNet, ImageNet等。）标记完后的文件夹如下：

标记完后的文件夹

可以看到，一共是31个文件夹，也就是31个目标类，字符0,M,W,I,O没有出现在验证码中。得到的有效字符为1371个，也就是1371个样本。以字母U为例，字母U的文件夹中的图片如下：

字母U的样本

统一尺寸

仅仅标记完图片后，还是没能达到建模的标准，这是因为得到的每个字符的图片大小是统一的。因此，我们需要这些样本字符统一尺寸，经过观察，笔者将统一尺寸定义为16*20，实现的Python脚本如下：

import os
import cv2
import uuid

def convert(dir, file):

    imagepath = dir+'/'+file
    # 读取图片
    image = cv2.imread(imagepath, 0)
    # 二值化
    ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
    img = cv2.resize(thresh, (16, 20), interpolation=cv2.INTER_AREA)
    # 显示图片

    cv2.imwrite('%s/%s.jpg' % (dir, uuid.uuid1()), img)
    os.remove(imagepath)

def main():
    chars = '123456789ABCDEFGHJKLNPQRSTUVXYZ'
    dirs= ['E://verifycode_data/%s'%char for char in chars]
    for dir in dirs:
        for file in os.listdir(dir):
            convert(dir, file)

main()

样本数据集

有了尺寸统一的字符图片，我们就需要将这些图片转化为向量。图片为黑白图片，因此，我们将图片读取为0-1值的向量，其标签（y值）为该图片所在的文件的名称。具体的Python实现脚本如下:

import os
import cv2
import pandas as pd

table= []

def Read_Data(dir, file):

    imagepath = dir+'/'+file
    # 读取图片
    image = cv2.imread(imagepath, 0)
    # 二值化
    ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
    # 显示图片
    bin_values = [1 if pixel==255 else 0 for pixel in thresh.ravel()]
    label = dir.split('/')[-1]
    table.append(bin_values+[label])

def main():
    chars = '123456789ABCDEFGHJKLNPQRSTUVXYZ'
    dirs= ['E://verifycode_data/%s'%char for char in chars]
    print(dirs)
    for dir in dirs:
        for file in os.listdir(dir):
            Read_Data(dir, file)

    features = ['v'+str(i) for i in range(1, 16*20+1)]
    label = ['label']
    df = pd.DataFrame(table, columns=features+label)
    # print(df.head())

    df.to_csv('E://verifycode_data/data.csv', index=False)
main()

我们将样本的字符图片转为为data.csv中的向量及标签，data.csv的部分内容如下：

字符图片对应的向量及标签

有了样本数据集，我们就可以用CNN来进行建模了。典型的CNN由多层卷积层（Convolution Layer）和池化层（Pooling Layer）组成, 最后由全连接网络层输出，示意图如下：

本文建模的CNN模型由两个卷积层和两个池化层，在此基础上增加一个dropout层（防止模型过拟合），再连接一个全连接层（Fully Connected），最后由softmax层输出结果。采用的损失函数为对数损失函数，用梯度下降法（GD）调整模型中的参数。具体的Python代码（VerifyCodeCNN.py）如下：

# -*- coding: utf-8 -*-
import tensorflow as tf
import logging

# 日志设置
logging.basicConfig(level = logging.INFO, format='%(asctime)s - %(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

class CNN:

    # 初始化
    # 参数为: epoch: 训练次数
    #        learning_rate: 使用GD优化时的学习率
    #        save_model_path: 模型保存的绝对路径
    def __init__(self, epoch, learning_rate, save_model_path):

        self.epoch = epoch
        self.learning_rate = learning_rate
        self.save_model_path = save_model_path

        """
        第一层 卷积层和池化层
        x_image(batch, 16, 20, 1) -> h_pool1(batch, 8, 10, 10)
        """
        x = tf.placeholder(tf.float32, [None, 320])
        self.x = x
        x_image = tf.reshape(x, [-1, 16, 20, 1])  # 最后一维代表通道数目，如果是rgb则为3
        W_conv1 = self.weight_variable([3, 3, 1, 10])
        b_conv1 = self.bias_variable([10])

        h_conv1 = tf.nn.relu(self.conv2d(x_image, W_conv1) + b_conv1)
        h_pool1 = self.max_pool_2x2(h_conv1)

        """
        第二层 卷积层和池化层
        h_pool1(batch, 8, 10, 10) -> h_pool2(batch, 4, 5, 20)
        """
        W_conv2 = self.weight_variable([3, 3, 10, 20])
        b_conv2 = self.bias_variable([20])

        h_conv2 = tf.nn.relu(self.conv2d(h_pool1, W_conv2) + b_conv2)
        h_pool2 = self.max_pool_2x2(h_conv2)

        """
        第三层 全连接层
        h_pool2(batch, 4, 5, 20) -> h_fc1(1, 100)
        """
        W_fc1 = self.weight_variable([4 * 5 * 20, 200])
        b_fc1 = self.bias_variable([200])

        h_pool2_flat = tf.reshape(h_pool2, [-1, 4 * 5 * 20])
        h_fc1 = tf.nn.relu(tf.matmul(h_pool2_flat, W_fc1) + b_fc1)

        """
        第四层 Dropout层
        h_fc1 -> h_fc1_drop, 训练中启用，测试中关闭
        """
        self.keep_prob = tf.placeholder(dtype=tf.float32)
        h_fc1_drop = tf.nn.dropout(h_fc1, self.keep_prob)

        """
        第五层 Softmax输出层
        """
        W_fc2 = self.weight_variable([200, 31])
        b_fc2 = self.bias_variable([31])

        self.y_conv = tf.nn.softmax(tf.matmul(h_fc1_drop, W_fc2) + b_fc2)

        """
        训练和评估模型
        ADAM优化器来做梯度最速下降,feed_dict中加入参数keep_prob控制dropout比例
        """
        self.y_true = tf.placeholder(shape = [None, 31], dtype=tf.float32)
        self.cross_entropy = -tf.reduce_mean(tf.reduce_sum(self.y_true * tf.log(self.y_conv), axis=1))  # 计算交叉熵

        # 使用adam优化器来以0.0001的学习率来进行微调
        self.train_model = tf.train.AdamOptimizer(self.learning_rate).minimize(self.cross_entropy)

        self.saver = tf.train.Saver()
        logger.info('Initialize the model...')

    def train(self, x_data, y_data):

        logger.info('Training the model...')

        with tf.Session() as sess:
            # 对所有变量进行初始化
            sess.run(tf.global_variables_initializer())

            feed_dict = {self.x: x_data, self.y_true: y_data, self.keep_prob:1.0}
            # 进行迭代学习
            for i in range(self.epoch + 1):
                sess.run(self.train_model, feed_dict=feed_dict)
                if i % int(self.epoch / 50) == 0:
                    # to see the step improvement
                    print('已训练%d次, loss: %s.' % (i, sess.run(self.cross_entropy, feed_dict=feed_dict)))

            # 保存ANN模型
            logger.info('Saving the model...')
            self.saver.save(sess, self.save_model_path)

    def predict(self, data):

        with tf.Session() as sess:
            logger.info('Restoring the model...')
            self.saver.restore(sess, self.save_model_path)
            predict = sess.run(self.y_conv, feed_dict={self.x: data, self.keep_prob:1.0})

        return predict

    """
    权重初始化
    初始化为一个接近0的很小的正数
    """
    def weight_variable(self, shape):
        initial = tf.truncated_normal(shape, stddev=0.1)
        return tf.Variable(initial)

    def bias_variable(self, shape):
        initial = tf.constant(0.1, shape=shape)
        return tf.Variable(initial)

    """
    卷积和池化，使用卷积步长为1（stride size）,0边距（padding size）
    池化用简单传统的2x2大小的模板做max pooling
    """
    def conv2d(self, x, W):
        return tf.nn.conv2d(x, W, strides=[1, 1, 1, 1], padding='SAME')

    def max_pool_2x2(self, x):
        return tf.nn.max_pool(x, ksize=[1, 2, 2, 1], strides=[1, 2, 2, 1], padding='SAME')

模型训练

对上述的1371个样本用CNN模型进行训练，训练集为960个赝本，411个样本为测试集。一共训练1000次，梯度下降法（GD）的学习率取0.0005.
模型训练的Python脚本如下：

# -*- coding: utf-8 -*-

"""
数字字母识别
利用CNN对验证码的数据集进行多分类
"""

from VerifyCodeCNN import CNN
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelBinarizer

CSV_FILE_PATH = 'E://verifycode_data/data.csv'          # CSV 文件路径
df = pd.read_csv(CSV_FILE_PATH)       # 读取CSV文件

# 数据集的特征
features = ['v'+str(i+1) for i in range(16*20)]
labels = df['label'].unique()
# 对样本的真实标签进行标签二值化
lb = LabelBinarizer()
lb.fit(labels)
y_ture = pd.DataFrame(lb.transform(df['label']), columns=['y'+str(i) for i in range(31)])
y_bin_columns = list(y_ture.columns)

for col in y_bin_columns:
    df[col] = y_ture[col]

# 将数据集分为训练集和测试集，训练集70%, 测试集30%
x_train, x_test, y_train, y_test = train_test_split(df[features], df[y_bin_columns], \
                                                    train_size = 0.7, test_size=0.3, random_state=123)

# 使用CNN进行预测
# 构建CNN网络
# 模型保存地址
MODEL_SAVE_PATH = 'E://logs/cnn_verifycode.ckpt'
# CNN初始化
cnn = CNN(1000, 0.0005, MODEL_SAVE_PATH)

# 训练CNN
cnn.train(x_train, y_train)
# 预测数据
y_pred = cnn.predict(x_test)

label = '123456789ABCDEFGHJKLNPQRSTUVXYZ'
# 预测分类
prediction = []
for pred in y_pred:
    label = labels[list(pred).index(max(pred))]
    prediction.append(label)

# 计算预测的准确率
x_test['prediction'] = prediction
x_test['label'] = df['label'][y_test.index]
print(x_test.head())
accuracy = accuracy_score(x_test['prediction'], x_test['label'])
print('CNN的预测准确率为%.2f%%.'%(accuracy*100))

该CNN模型一共训练了75min，输出的结果如下：

2018-09-24 11:51:17,784 - INFO: Initialize the model...
2018-09-24 11:51:17,784 - INFO: Training the model...
2018-09-24 11:51:17.793631: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
已训练0次, loss: 3.5277689.
已训练20次, loss: 3.2297606.
已训练40次, loss: 2.8372495.
已训练60次, loss: 1.9687067.
已训练80次, loss: 0.90995216.
已训练100次, loss: 0.42356998.
已训练120次, loss: 0.25189978.
已训练140次, loss: 0.16736577.
已训练160次, loss: 0.116674595.
已训练180次, loss: 0.08325087.
已训练200次, loss: 0.06060778.
已训练220次, loss: 0.045051433.
已训练240次, loss: 0.03401592.
已训练260次, loss: 0.026168587.
已训练280次, loss: 0.02056558.
已训练300次, loss: 0.01649161.
已训练320次, loss: 0.013489108.
已训练340次, loss: 0.011219621.
已训练360次, loss: 0.00946489.
已训练380次, loss: 0.008093053.
已训练400次, loss: 0.0069935927.
已训练420次, loss: 0.006101626.
已训练440次, loss: 0.0053245267.
已训练460次, loss: 0.004677901.
已训练480次, loss: 0.0041349586.
已训练500次, loss: 0.0036762774.
已训练520次, loss: 0.003284876.
已训练540次, loss: 0.0029500276.
已训练560次, loss: 0.0026618005.
已训练580次, loss: 0.0024126293.
已训练600次, loss: 0.0021957452.
已训练620次, loss: 0.0020071461.
已训练640次, loss: 0.0018413183.
已训练660次, loss: 0.001695599.
已训练680次, loss: 0.0015665392.
已训练700次, loss: 0.0014519279.
已训练720次, loss: 0.0013496162.
已训练740次, loss: 0.001257321.
已训练760次, loss: 0.0011744777.
已训练780次, loss: 0.001099603.
已训练800次, loss: 0.0010316349.
已训练820次, loss: 0.0009697884.
已训练840次, loss: 0.00091331534.
已训练860次, loss: 0.0008617487.
已训练880次, loss: 0.0008141668.
已训练900次, loss: 0.0007705136.
已训练920次, loss: 0.0007302323.
已训练940次, loss: 0.00069312396.
已训练960次, loss: 0.0006586343.
已训练980次, loss: 0.00062668725.
2018-09-24 13:07:42,272 - INFO: Saving the model...
已训练1000次, loss: 0.0005970755.
2018-09-24 13:07:42,538 - INFO: Restoring the model...
INFO:tensorflow:Restoring parameters from E://logs/cnn_verifycode.ckpt
2018-09-24 13:07:42,538 - INFO: Restoring parameters from E://logs/cnn_verifycode.ckpt
      v1  v2  v3  v4  v5  v6  v7  v8  v9  v10  ...    v313  v314  v315  v316  \
657    1   1   1   1   1   1   1   1   1    1  ...       1     1     1     1   
18     1   1   1   1   1   1   1   1   1    1  ...       1     1     1     1   
700    1   1   1   1   1   1   1   1   1    1  ...       1     1     1     1   
221    1   1   1   1   1   1   1   1   1    1  ...       1     1     1     1   
1219   1   1   1   1   1   1   1   1   1    1  ...       1     1     1     1   

      v317  v318  v319  v320  prediction  label  
657      1     1     1     1           G      G  
18       1     1     1     1           T      1  
700      1     1     1     1           H      H  
221      1     1     1     1           5      5  
1219     1     1     1     1           V      V  

[5 rows x 322 columns]
CNN的预测准确率为93.45%.

可以看到，该CNN模型在测试集上的预测准确率为93.45%，效果OK.训练完后的模型保存为 E://logs/cnn_verifycode.ckpt.

预测新验证码

训练完模型，以下就是见证奇迹的时刻！
笔者重新在刚才的账号注册网站弄了60张验证码，新的验证码如下：

新验证码

笔者写了个预测验证码的Pyhton脚本，如下：

# -*- coding: utf-8 -*-

"""
利用训练好的CNN模型对验证码进行识别
（共训练960条数据，训练1000次，loss:0.00059, 测试集上的准确率为%93.45.）
"""
import os
import cv2
import pandas as pd
from VerifyCodeCNN import CNN

def split_picture(imagepath):

    # 以灰度模式读取图片
    gray = cv2.imread(imagepath, 0)

    # 将图片的边缘变为白色
    height, width = gray.shape
    for i in range(width):
        gray[0, i] = 255
        gray[height-1, i] = 255
    for j in range(height):
        gray[j, 0] = 255
        gray[j, width-1] = 255

    # 中值滤波
    blur = cv2.medianBlur(gray, 3) #模板大小3*3

    # 二值化
    ret,thresh1 = cv2.threshold(blur, 200, 255, cv2.THRESH_BINARY)

    # 提取单个字符
    chars_list = []
    image, contours, hierarchy = cv2.findContours(thresh1, 2, 2)
    for cnt in contours:
        # 最小的外接矩形
        x, y, w, h = cv2.boundingRect(cnt)
        if x != 0 and y != 0 and w*h >= 100:
            chars_list.append((x,y,w,h))

    sorted_chars_list = sorted(chars_list, key=lambda x:x[0])
    for i,item in enumerate(sorted_chars_list):
        x, y, w, h = item
        cv2.imwrite('E://test_verifycode/chars/%d.jpg'%(i+1), thresh1[y:y+h, x:x+w])

def remove_edge_picture(imagepath):

    image = cv2.imread(imagepath, 0)
    height, width = image.shape
    corner_list = [image[0,0] < 127,
                   image[height-1, 0] < 127,
                   image[0, width-1]<127,
                   image[ height-1, width-1] < 127
                   ]
    if sum(corner_list) >= 3:
        os.remove(imagepath)

def resplit_with_parts(imagepath, parts):
    image = cv2.imread(imagepath, 0)
    os.remove(imagepath)
    height, width = image.shape

    file_name = imagepath.split('/')[-1].split(r'.')[0]
    # 将图片重新分裂成parts部分
    step = width//parts     # 步长
    start = 0             # 起始位置
    for i in range(parts):
        cv2.imwrite('E://test_verifycode/chars/%s.jpg'%(file_name+'-'+str(i)), \
                    image[:, start:start+step])
        start += step

def resplit(imagepath):

    image = cv2.imread(imagepath, 0)
    height, width = image.shape

    if width >= 64:
        resplit_with_parts(imagepath, 4)
    elif width >= 48:
        resplit_with_parts(imagepath, 3)
    elif width >= 26:
        resplit_with_parts(imagepath, 2)

# rename and convert to 16*20 size
def convert(dir, file):

    imagepath = dir+'/'+file
    # 读取图片
    image = cv2.imread(imagepath, 0)
    # 二值化
    ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
    img = cv2.resize(thresh, (16, 20), interpolation=cv2.INTER_AREA)
    # 保存图片
    cv2.imwrite('%s/%s' % (dir, file), img)

# 读取图片的数据，并转化为0-1值
def Read_Data(dir, file):

    imagepath = dir+'/'+file
    # 读取图片
    image = cv2.imread(imagepath, 0)
    # 二值化
    ret, thresh = cv2.threshold(image, 127, 255, cv2.THRESH_BINARY)
    # 显示图片
    bin_values = [1 if pixel==255 else 0 for pixel in thresh.ravel()]

    return bin_values

def main():

    VerifyCodePath = 'E://test_verifycode/E224.jpg'
    dir = 'E://test_verifycode/chars'
    files = os.listdir(dir)

    # 清空原有的文件
    if files:
        for file in files:
            os.remove(dir + '/' + file)

    split_picture(VerifyCodePath)

    files = os.listdir(dir)
    if not files:
        print('查看的文件夹为空！')
    else:

        # 去除噪声图片
        for file in files:
            remove_edge_picture(dir + '/' + file)

        # 对黏连图片进行重分割
        for file in os.listdir(dir):
            resplit(dir + '/' + file)

        # 将图片统一调整至16*20大小
        for file in os.listdir(dir):
            convert(dir, file)

        # 图片中的字符代表的向量
        table = [Read_Data(dir, file) for file in os.listdir(dir)]
        test_data = pd.DataFrame(table, columns=['v%d'%i for i in range(1,321)])

        # 模型保存地址
        MODEL_SAVE_PATH = 'E://logs/cnn_verifycode.ckpt'
        # CNN初始化
        cnn = CNN(1000, 0.0005, MODEL_SAVE_PATH)
        y_pred = cnn.predict(test_data)

        # 预测分类
        prediction = []
        labels = '123456789ABCDEFGHJKLNPQRSTUVXYZ'
        for pred in y_pred:
            label = labels[list(pred).index(max(pred))]
            prediction.append(label)

        print(prediction)


main()

以图片E224.jpg为例，输出的结果为：

2018-09-25 20:50:33,227 - INFO: Initialize the model...
2018-09-25 20:50:33.238309: I T:\src\github\tensorflow\tensorflow\core\platform\cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2
2018-09-25 20:50:33,227 - INFO: Restoring the model...
INFO:tensorflow:Restoring parameters from E://logs/cnn_verifycode.ckpt
2018-09-25 20:50:33,305 - INFO: Restoring parameters from E://logs/cnn_verifycode.ckpt
['E', '2', '2', '4']

预测完全准确。接下来我们对所有的60张图片进行测试，一共有54张图片预测完整正确，其他6张验证码有部分错误，预测的准确率高达90%.

总结

在验证码识别的过程中，CNN模型大放异彩，从中我们能够感受到深度学习的强大~
当然，文本识别的验证码还是比较简单的，只是作为CNN的一个应用，对于更难的验证码，处理的流程会更复杂，希望读者在读者此文后，可以自己去尝试更难的验证码识别~~

注意：本人现已开通微信公众号：轻松学会Python爬虫（微信号为：easy_web_scrape），欢迎大家关注哦~~