第8章验证码的识别---1、图形验证码的识别+2、极验验证码的识别

图形验证码一般是4位字母或者数字。这类验证码利用OCR识别技术识别。需要python库为tesserocr，其需要tesseract的支持，后者下载链接：https://github.com/tesseract-ocr/tesseract

下载完成后，安装过程中，需要注意，要勾选：Additional language data download 选项来安装 OCR 识别支持的语言包。

接下来，可以使用pip安装tesserocr：pip install tesserocr pillow

检验安装：

import tesserocr 
from PIL import Image 
image = Image.open('test.png') 
print(tesserocr.image_to_text(image))

识别测试：

import tesserocr 
from PIL import Image 
image = Image.open('code.jpg')  #建立Image对象
result=tesserocr.image_to_text(image)  #调用image_to_text方法，传入image对象
print(result)

# 简便方法如下，但效果不如上面的
import tesserocr
print(tesserocr.file_to_text('code.jpg'))

遇到验证码图片中存在多余的线条，会对识别结果产生干扰，所以要对图片进行转灰度、二值化等操作。

就如上面这个验证码：

import tesserocr
from PIL import Image

image=image.convert('L') # 传参L，图像转化为灰度图像
threshold = 80 # 设置二值化阈值为80；如若不设置，默认为127，
table=[]
for i in range(256):
    if i<threshold:
        table.append(0)
    else:
        table.append(1)
# 类似于数组，大于阈值者，标为1；否则标为0
image=image.point(table,'1')
image.show()
result=tesserocr.image_to_text(image)
print(result)

2、极验验证码的识别

也就是需要拖动拼合模块才完成验证。需要Selenium、浏览器Chrome、并配置ChromeDriver。

极验验证码官网为：http://www.geetest.com/，其特点为：

三角防护之防模拟、防伪造、防验证。极验验证码官网说明为：点击后验证只需要0.4秒、全平台兼容。

识别方法为：

采用直接模拟浏览器动作的方式来完成验证。也就是使用Selenium来模拟人的行为方式。

比如：https://account.geetest.com/login，此按钮为智能验证按钮，同一个会话内，一段时间内二次点击会直接通过验证。如果第二次点击不通过，则会弹出滑动验证窗口，需要拖动滑块合并图像完成二次验证。

识别验证如下三步：

模拟点击按钮
识别滑动缺口的位置
模拟拖动滑块

第一步，采用Selenium模拟点击。

第二步，识别缺口位置，需要图像的相关处理，边缘检测算法来找出缺口位置。在没有移动之前，图片的显示是没有缺口的，可以用此图片与有缺口的图片进行对比。

可以获取两张图片。设定一个对比阈值，然后遍历两个图片的像素点，找出相同位置像素RGB差距超出此阈值d的像素点，那么该像素点就是缺口的位置。

第三步，移动滑块。极验验证码增加机器轨迹识别，匀速移动、随机移动等，人移动滑块一般就是先加速后减速，我们需要模拟这个过程。

初始化

Selenium对象的初始化及一些参数的配置。

Account='******'
Password='******'

class CrackGeetest():
    def __init__(self):
        self.url=''
        self.browser=webdriver.Chrome()
        self.wait=WebDriverWait(self.browser,20)
        self.email=Account
        self.password=Password

模拟点击

首先要显示等待来获取待点击的验证按钮：

def get_geetest_button(self):
    '''
    获取初始化验证按钮，return按钮对象
    '''
            button=self.wait.until(EC.element_to_be_clickable((By.CLASS_NAME,'geetest_radar_tip')))
    return button

# 获取到了WebElement对象，调用它的click()方法即可模拟点击
# 点击验证按钮
button =self.get_geetest_button()
button.click()

识别缺口

获取两张对比图片，二者不一致即为缺口，利用Selenium选取图片元素，得到其所在位置和宽高，然后获取整个网页的截图，图片裁取处理即可。

    def get_position(self):
        """
        获取验证码位置
        :return: 验证码位置元组
        """
        img = self.wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'geetest_canvas_img'))) #等待图片加载出来，
        time.sleep(2)
        location = img.location #获取位置
        size = img.size #获取尺寸
        top, bottom, left, right = location['y'], location['y'] + size['height'], location['x'], location['x'] + size[
            'width']
        # 获取左上角和右下角坐标
        return (top, bottom, left, right)

    def get_geetest_image(self, name='captcha.png'):
        """
        获取验证码图片
        :return: 图片对象
        """
        top, bottom, left, right = self.get_position()
        print('验证码位置', top, bottom, left, right)
        # 获取网页截图
        screenshot = self.get_screenshot()
        # crop方法将图片裁切出来
        captcha = screenshot.crop((left, top, right, bottom))
        captcha.save(name)
        return captcha

接下来获取第二个有缺口的图片，点击下面滑块，这个动作触发后，带有缺口的照片就会显现。

def get_slider(self):
        """
        获取滑块
        :return: 滑块对象
        """
        slider = self.wait.until(EC.element_to_be_clickable((By.CLASS_NAME, 'geetest_slider_button')))
        return slider

get_slider()方法获取滑块对象，调用click()方法即可触发点击，缺口图片显现，

# 点按呼出缺口

slider=self.get_slider()
slider.click()
# 调用get_geetest_image()方法将第二张图片下载

得到两个图像Image1和Image2，对比缺口，遍历像素点，获取两个图片像素点的RGB数据。若二者的RGB数据差距在一定范围之内，则像素点相同，继续对比下一个像素点。差距超过一定范围，则像素点不同，也就是缺口位置。代码如下：

    def is_pixel_equal(self, image1, image2, x, y):
        """
        判断两个像素是否相同
        :param image1: 图片1
        :param image2: 图片2
        :param x: 位置x
        :param y: 位置y
        :return: 像素是否相同
        """
        # 取两个图片的像素点
        pixel1 = image1.load()[x, y]
        pixel2 = image2.load()[x, y]
        threshold = 60
        if abs(pixel1[0] - pixel2[0]) < threshold and abs(pixel1[1] - pixel2[1]) < threshold and abs(
                pixel1[2] - pixel2[2]) < threshold:
            return True
        else:
            return False

    def get_gap(self, image1, image2):
        """
        获取缺口偏移量
        :param image1: 不带缺口图片
        :param image2: 带缺口图片
        :return:
        """
        left = 60
        for i in range(left, image1.size[0]):
            for j in range(image1.size[1]):
                if not self.is_pixel_equal(image1, image2, i, j):
                    left = i
                    return left
        return left

get_gap()方法传入两个照片为参数，遍历像素点，通过函数is_pixel_equal()判断两个图片的每一个像素点的绝对值是否小于阈值threshold。处于阈值之外，则为缺口位置。一般来说，缺口位置在滑块的右侧，直接从滑块右侧为起始坐标点开始遍历，这样识别出来的结果就是缺口位置。

模拟拖动

若为匀速拖动，会被机器学习模型识别为程序操作。我们进行分段模拟，先匀加速、再进行匀减速。

利用物理学公式，加速度为a，当前速度为v，初速度为V0，位移用x表示，所需时间为t，满足关系为下：

X=V0+0.5*a*t*t、V=V0+a*t

def get_track(self, distance):
        """
        根据偏移量获取移动轨迹
        :param distance: 偏移量
        :return: 移动轨迹
        """
        # 移动轨迹
        track = []
        # 当前位移
        current = 0
        # 减速阈值
        mid = distance * 4 / 5 #减速的阈值，前4/5为加速，后1/5为减速。
        # 计算间隔
        t = 0.2
        # 初速度
        v = 0
        
        while current < distance:
            if current < mid:
                # 加速度为正2
                a = 2
            else:
                # 加速度为负3
                a = -3
            # 初速度v0
            v0 = v
            # 当前速度v = v0 + at
            v = v0 + a * t
            # 移动距离x = v0t + 1/2 * a * t^2
            move = v0 * t + 1 / 2 * a * t * t
            # 当前位移
            current += move
            # 加入轨迹
            track.append(round(move))
            # 记录时间间隔为0.2s内的移动距离
        return track

get_track()传入参数为移动的总距离，返回的是运动轨迹，运动轨迹为track，是一个列表，列表的每一个元素为时间间隔内移动的距离，

     def move_to_gap(self, slider, track):
        """
        拖动滑块到缺口处
        :param slider: 滑块
        :param track: 轨迹
        :return:
        """
        ActionChains(self.browser).click_and_hold(slider).perform()
        for x in track:
            ActionChains(self.browser).move_by_offset(xoffset=x, yoffset=0).perform()
        time.sleep(0.5)
        ActionChains(self.browser).release().perform()

move_to_gap()传入的参数为滑块对象和运动轨迹。首先调用ActionChains的click_and_hold()按住拖动d底部滑块，遍历运动轨迹获取每一段的移动距离，调用move_by_offset()方法移动此位移，最后调用release()方法松开鼠标。

最后，完善表单，模拟点击登录按钮，成功登陆后即跳转到后台。至此，极验验证码识别工作全部完成。

源码地址：https://github.com/Python3WebSpider/CrackGeetest

锅巴QAQ

发布了92 篇原创文章 · 获赞 23 · 访问量 6万+

私信关注

第8章 验证码的识别---1、图形验证码的识别+2、极验验证码的识别

猜你喜欢

第8章验证码的识别---1、图形验证码的识别+2、极验验证码的识别