Based on Python crawler + KNN digital verification code recognition system - machine learning algorithm application (including all engineering source code) + training data set


insert image description here

foreword

This project uses Python crawler technology to crawl the verification code pictures through the network, and through a series of processing steps, including denoising and segmentation, to realize the identification and accuracy verification of the verification code.

First, we use Python crawler technology to automatically obtain captcha images from the target website. These captchas are usually designed to prevent bots from automatically accessing or submitting forms. Through crawler technology, we can obtain these verification code pictures for subsequent processing.

Next, we denoise and segment the obtained verification code image. The denoising operation can help remove noise information in the picture, making the verification code clearer and easier to read. The segmentation operation extracts each character in the verification code separately for subsequent recognition.

Subsequently, we use the KNN algorithm (K-Nearest Neighbors) to train a model. The KNN algorithm is a commonly used supervised learning algorithm, which can be classified according to the characteristics and labels of samples. We will use the KNN algorithm to train the segmented captcha characters so that the model can recognize different characters.

Finally, we verify the accuracy of the trained model. By inputting part of the verification code pictures with known labels into the model, we can get the prediction result of the model. Comparing the predicted results with the actual labels, the accuracy of the model can be calculated to evaluate the performance of the model.

The goal of this project is to realize the automatic identification and accuracy verification of verification code pictures through crawler technology and image processing algorithms. It can be applied to scenarios that need to process a large number of verification codes, such as automated testing, data collection, etc. Through the process of training and verifying the model, we can continuously improve the accuracy and stability of verification code recognition and improve the efficiency of verification code processing.

overall design

This part includes the overall structure diagram of the system and the system flow chart.

System overall structure diagram

The overall structure of the system is shown in the figure.

insert image description here

System flow chart

The system flow is shown in the figure.

insert image description here

operating environment

This part is mainly for the Python environment

Python environment

Python 2.7 configuration is required. Download Anaconda in the Windows environment to complete the configuration required for Python. The download address is https://www.anaconda.com/ . You can also download a virtual machine to run the code in the Linux environment.

module implementation

This project includes 4 modules: data crawling, denoising and segmentation, model training and storage, and accuracy verification. The functions and related codes of each module are introduced below.

1. Data crawling

This part uses the request library crawler to grab 1200 verification codes and mark them. The relevant code is as follows:

from __future__ import unicode_literals
import requests
import time
if __name__ == "__main__":
    #获取图片总数设置number
    number = 100
    for num in range(number):
        img_url='http://run.hbut.edu.cn/Account/LogOn?ReturnUrl=%2f '
        data={
    
    'timestamp':unicode(long(time.time()*1000))}
        #print (img_url)
        res = requests.get(img_url,params=data)
#这是一个get请求,获取图片资源
       with open("./download_image/%d.jpg" % num, "wb") as f:  
#将图片保存在本地
            f.write(res.content)
            print("%d.jpg" % num + "获取成功")

insert image description here

2. Denoising and Segmentation

After the image is successfully crawled, denoising and segmentation are performed.

1) Remove background noise

After converting to a grayscale image, segment the image, remove the border and part of the noise, divide it into 4 images, count the grayscale histogram of each image (set by yourself), and binsfind the second largest corresponding pixel range (that is, a certain - pixel The number of pixels in the range is the second largest, and the corresponding pixel range. The most pixels should be white, blank space), take the median mode of the pixel range, and then retain (mode biases) pixels to remove most of the noise.

def del_noise(im_cut):
    bins = 16
    num_gray = math.ceil(256 / bins)
#函数返回大于或等于一个给定数字的最小整数
    hist = cv2.calcHist([im_cut], [0], None, [bins], [0, 256])
    lists = []
    for i in range(len(hist)): 
    #print hist[i][0]
        lists.append(hist[i][0])
    second_max = sorted(lists)[-2]
    #查找第二多像素,最多的是验证码空白
    bins_second_max = lists.index(second_max)
        #取像素范围中位数mode,保留(mode±biases)的像素
    mode = (bins_second_max + 0.5) * num_gray
    for i in range(len(im_cut)):
        for j in range(len(im_cut[0])):
    #print im_cut[i][j]
       if im_cut[i][j] < mode - 15 or im_cut[i][j] > mode + 15:]
    #不在中位数附近的设为白(255)
                im_cut[i][j] = 255
    return im_cut

2) Image segmentation

Segment 1200 marked pictures to get 4800 sub-pictures. The relevant code is as follows:

def cut_image(image, num, img_name):
   #image = cv2.imread('./img/8.jpg')
   #将BGR格式图片转换成灰度图片
    im = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
   #im_cut_real = im[8:47, 28:128]
    im_cut_1 = im[8:47, 27:52]
    im_cut_2 = im[8:47, 52:77]
    im_cut_3 = im[8:47, 77:102]
    im_cut_4 = im[8:47, 102:127]
    im_cut = [im_cut_1, im_cut_2, im_cut_3, im_cut_4]
    for i in range(4):
        im_temp = del_noise(im_cut[i])
     #将图片分割为4个
     cv2.imwrite('./img_train_cut/'+str(num)+ '_' + str(i)+'_'+img_name[i]+'.jpg', im_temp)
if __name__ == '__main__':
    img_dir = './img'
    img_name = os.listdir(img_dir)  #列出文件夹下所有的目录与文件
    for i in range(len(img_name)):
        path = os.path.join(img_dir, img_name[i])
        image = cv2.imread(path)
        name_list = list(img_name[i])[:4]
        #name = ''.join(name_list)
        cut_image(image, i, name_list)
        print '图片%s分割完成' % (i)
    print u'*****图片分割预处理完成!*****'

The image is segmented successfully, as shown in the following figure:

insert image description here

3. Model training and storage

After processing the data, split the training set and test set, train and save. The relevant code is as follows:

_image(image, num, img_name):
   #image = cv2.imread('./img/8.jpg')
   #将BGR格式图片转换成灰度图片
    im = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
   #im_cut_real = im[8:47, 28:128]
    im_cut_1 = im[8:47, 27:52]
    im_cut_2 = im[8:47, 52:77]
    im_cut_3 = im[8:47, 77:102]
    im_cut_4 = im[8:47, 102:127]
    im_cut = [im_cut_1, im_cut_2, im_cut_3, im_cut_4]
    for i in range(4):
        im_temp = del_noise(im_cut[i])
     #将图片分割为4个
     cv2.imwrite('./img_train_cut/'+str(num)+ '_' + str(i)+'_'+img_name[i]+'.jpg', im_temp)
if __name__ == '__main__':
    img_dir = './img'
    img_name = os.listdir(img_dir)  #列出文件夹下所有的目录与文件
    for i in range(len(img_name)):
        path = os.path.join(img_dir, img_name[i])
        image = cv2.imread(path)
        name_list = list(img_name[i])[:4]
        #name = ''.join(name_list)
        cut_image(image, i, name_list)
        print '图片%s分割完成' % (i)
    print u'*****图片分割预处理完成!*****'
8.3.3 模型训练及保存
import numpy as np
from sklearn import neighbors
import os
from sklearn.preprocessing import LabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.externals import joblib
import cv2
if __name__ == '__main__':
    #读入数据
    data = []
    labels = []
    img_dir = './img_train_cut'
    img_name = os.listdir(img_dir)
    #number = ['0','1', '2','3','4','5','6','7','8','9']
    for i in range(len(img_name)):
        path = os.path.join(img_dir, img_name[i])
        #cv2读进来的图片是RGB3维的,转成灰度图,将图片转化成1维
        image = cv2.imread(path)
        im = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
        image = im.reshape(-1)
        data.append(image)
        y_temp = img_name[i][-5]
        labels.append(y_temp)
    #标签规范化
    y = LabelBinarizer().fit_transform(labels)
    x = np.array(data)
    y = np.array(y)
    #拆分训练数据与测试数据
    x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
    #训练KNN分类器
    clf = neighbors.KNeighborsClassifier()
    clf.fit(x_train, y_train)
    #保存分类器模型
    joblib.dump(clf, './knn.pkl')
    #测试结果打印
    pre_y_train = clf.predict(x_train)
    pre_y_test = clf.predict(x_test)
    class_name = ['class0', 'class1', 'class2', 'class3', 'class4', 'class5', 'class6', 'class7', 'class8', 'class9']
    print classification_report(y_train, pre_y_train, target_names=class_name)
    print classification_report(y_test, pre_y_test, target_names=class_name)
    #clf = joblib.load('knn.pkl')
    #pre_y_test = clf.predict(x)
    #print pre_y_test
    #print classification_report(y, pre_y_test, target_names=class_name)

After the model is saved, it can be reused or transplanted to other environments.

4. Accuracy verification

Use the original image of the verification code (4 numbers) to test the accuracy, the relevant codes are as follows:

from __future__ import division
import cv2
import math
import numpy as np
import os
from sklearn.externals import joblib
def del_noise(im_cut):
  '''
  变量bins:灰度直方图bin的数目
  num_gray:像素间隔
  方法:1.找到灰度直方图中像素第二多对应的像素,即second_max,因为图像空白处比较多,所以第一多的应该是空白,第二多的是想要的内容。2.计算mode。3.除了mode附近,全部变为空白。
  '''
    bins = 16
    num_gray = math.ceil(256 / bins)
    hist = cv2.calcHist([im_cut], [0], None, [bins], [0, 256])
    lists = []
    for i in range(len(hist)):
        #print hist[i][0]
        lists.append(hist[i][0])
        #将 hist 列表添加到 lists
    second_max = sorted(lists)[-2]
#查找第二多像素,最多的是验证码的空白
    bins_second_max = lists.index(second_max)
#取像素范围中位数mode,然后保留(mode±biases)的像素
    mode = (bins_second_max + 0.5) * num_gray
    for i in range(len(im_cut)):
        for j in range(len(im_cut[0])):
           if im_cut[i][j]<mode - 15 or im_cut[i][j] > mode + 15:
                # print im_cut[i][j]
                im_cut[i][j] = 255
                #不在中位数附近的设为白(255)
    return im_cut
def predict(image, img_name):
    #image = cv2.imread('./img/8.jpg')
    im = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
    #将BGR格式转换成灰度图片
    #im_cut_real = im[8:47, 28:128]
    im_cut_1 = im[8:47, 27:52]
    im_cut_2 = im[8:47, 52:77]
    im_cut_3 = im[8:47, 77:102]
    im_cut_4 = im[8:47, 102:127]
    im_cut = [im_cut_1, im_cut_2, im_cut_3, im_cut_4]
    pre_text = []
    for i in range(4):
        #图片转换成1维后,再转换成2维的输入变量x
        im_temp = del_noise(im_cut[i])
        #print type(im_temp)
        image = im_temp.reshape(-1)
        #print image.shape
        tmp = []
        tmp.append(list(image))
        x = np.array(tmp)
        pre_y = clf.predict(x)
        pre_y = np.argmax(pre_y[0])
        pre_text.append(str(pre_y))
    #print pre_text
    pre_text = ''.join(pre_text)
    if pre_text != img_name:
print'label:%s'%(img_name),'predict:%s'%(pre_text),'\t','false'
        return 0
    else:
        print 'label:%s'%(img_name),'predict:%s'%(pre_text)
        return 1
if __name__ == '__main__':
    img_dir = './img_test'
    img_name = os.listdir(img_dir)  #列出文件夹下所有的目录与文件
    right = 0
    global clf
    clf = joblib.load('knn.pkl')
    for i in range(len(img_name)):
        path = os.path.join(img_dir, img_name[i])
        image = cv2.imread(path)
        name_list = list(img_name[i])[:4]
        name = ''.join(name_list)
        pre = predict(image, name)
        right += pre
    accuracy = (right/len(img_name))*100
 print u'准确率为:%s%%,一共%s张验证码,正确:%s,错误:%s'%(accuracy,len(img_name),right,len(img_name)-right)

System test

The accuracy of the test results is over 99%, as shown in the figure below.

insert image description here

Project source code download

See my blog resource download page for details


Other information download

If you want to continue to learn about artificial intelligence-related learning routes and knowledge systems, welcome to read my other blog " Heavy | Complete artificial intelligence AI learning-basic knowledge learning route, all materials can be downloaded directly from the network disk without paying attention to routines "
This blog refers to Github's well-known open source platform, AI technology platform and experts in related fields: Datawhale, ApacheCN, AI Youdao and Dr. Huang Haiguang, etc. There are about 100G related materials, and I hope to help all friends.

Guess you like

Origin blog.csdn.net/qq_31136513/article/details/131571160