爬虫处理普通验证码

自打有爬虫以来，爬虫与验证码的战斗就一直在进行着。下面是我处理简单验证码的一点心得：

一、登录验证码：

很多网站采取登录用户名+密码+图片验证码的方式进行登录。对于简单的图片验证码可以采用ocr光学标识符进行识别，而对于比较复杂的验证码则需要进行一些复杂的操作。

步骤一：获取图片验证码，并且保存为文件

方法：使用webdriver截频功能获取图片验证码，代码如下：

from selenium import webdriver
from PIL import Image


def login():
    driver = webdriver.Chrome()
    driver.maximize_window()  # 窗口最大化
    driver.get('登录网址l') 
    time.sleep(10)
    driver.save_screenshot('printscreen.png')
    imgelement = driver.find_element_by_css_selector("#wrap > div.sign-wrap > div.sign-form.sign-sms > form > div.form-row.row-code > img")  # 定位验证码
    location = imgelement.location  # 获取验证码x,y轴坐标
    size = imgelement.size  # 获取验证码的长宽
    rangle = (int(location['x']), int(location['y']), int(location['x'] + size['width']),
              int(location['y'] + size['height']))  # 写成我们需要截取的位置坐标
    i = Image.open("printscreen.png")  # 打开截图
    frame4 = i.crop(rangle)  # 使用Image的crop函数，从截图中再次截取我们需要的区域
    frame4.save('save.png')  # 保存我们接下来的验证码图片 进行打码

步骤二：验证码的处理

方法：对于获取到的验证码图片，简单的可以自己处理，复杂的可以进行人工平台大码或者深度学习的方法识别验证码。

验证码如下：

import pytesseract
from PIL import Image

img = Image.open('9952.png')
res = pytesseract.image_to_string(img)
print(res)

运行结果如下：

C:\Python\Python36\python.exe E:/desktop_file/maimai_register/clawerImgs/tress.py
9952

二、爬虫过程中验证码：

处理爬虫过程中因为速度太快导致的验证码问题。

通常网站会使用重定向进行验证码的处理：

浏览a页面---弹出验证码---获取验证码---提交验证码---重定向到a页面

处理思路：

1.获取a页面的url以及请求信息

2.获取验证码，发送get请求

3.保存验证码图片，使用解析工具获取图片内容（人工大码）

4.提交验证码内容至服务器，验证通过

5.重定向到a页面url对应的网址

介绍一个好用的解析网址：https://curl.trillworks.com/喜欢的可以收藏一下这个网址

举个栗子：某网站的验证码处理过程---仅仅展示核心代码

    def get_capture(self):
        """获取验证码图片"""
        self.randomkey = str(int(1000*time.time()))
        headers = {
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'zh-CN,zh;q=0.9',
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3409.2 Safari/537.36',
            'accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
            'referer': 'https://www.zhipin.com/captcha/popUpCaptcha?redirect={}'.format(self.url),
            'authority': 'www.zhipin.com',
            'cookie': self.cookies
        }

        params = (
            ('randomKey', self.randomkey ),
        )

        response = requests.get('https://www.zhipin.com/captcha', headers=headers, params=params)
        return response.content


    def pass_capture(self, capture_res):
        """发送验证码给服务器"""
        headers = {
            'authority': 'www.zhipin.com',
            'cache-control': 'max-age=0',
            'origin': 'https://www.zhipin.com',
            'upgrade-insecure-requests': '1',
            'content-type': 'application/x-www-form-urlencoded',
            'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3409.2 Safari/537.36',
            'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
            # 'referer': 'https://www.zhipin.com/captcha/popUpCaptcha?redirect=https%3A%2F%2Fwww.zhipin.com%2Fboss%2Fsearch%2Fgeek%2Finfo%3Fsuid%3D1fc97ddf37d119f4KGQMV4z1idI%7E%26jid%3D0%26lid%3D12659U92LQM.lookupsearchgeek.1%26expectId%3D2726533%26segs%3Djava',
            'referer': self.url,
            'accept-encoding': 'gzip, deflate, br',
            'accept-language': 'zh-CN,zh;q=0.9',
            'cookie': self.cookies
        }

        params = (
            ('redirect', self.url),)
        # capture = input("pallse input capture")
        data = [
            ('randomKey', self.randomkey),
            ('captcha', capture_res),
        ]
        response = requests.post('https://www.zhipin.com/captcha/verifyCaptcha', headers=headers, params=params,
                                 data=data)
        print(response.url)
        print(response.text)
        return response

    def verify_code(self, filename):    
        """发送图片给大码平台获取验证码"""
        l = self.logger
        retry = 0
        while True:
            retry += 1
            if retry == 3:
                return ''
            try:
                captcha_id, vycode = self.dama2.decode_captcha(captcha_type=3040, file_path=filename)
                return vycode
            except Exception as e:
                l.info(e)
                l.info('verify error')
                pass

    def run(self, url):
        retry = 0
        while True:
            if retry == 10:
                return False
            capture = get_capture()
            print(capture)
            filename=os.getcwd()+"/captcha.png"
            with open(filename, "wb") as f:
                f.write(capture)
            capture_res = verify_code(filename)
            if not capture_res:
                print('获取验证码失败')
                continue
            print('验证码是：{}'.format(capture_res))
            res = pass_capture(capture_res)
            print("capture pass success!!!!!!!!!!!!!!!!!!!! {}".format(res.text))
            time.sleep(20)
            if '<div class="tips">为了您的账号安全，我们需要在执行操作之前验证您的身份，请输入验证码。</div>' not in res.text:
                print("释放验证码成功")
                return True
            retry += 1

三、复杂验证码的拼接重组

有些网站的验证码图片，网页上显示，如下：

但是打开调试工具发现图片其实背景图如下所示：

很明显这个背景图是切片组上去的，所以在js中找到了定位的坐标：

果不其然，页面显示的图片对背景图进行了切片显示，所以我们要还原原图就要逆天做法。

下面是代码思路：

from PIL import Image, ImageDraw

# 大图的长宽为：21.98 * 58
offset_list = [['66', '40'], ['286', '40'], ['66', '98'], ['44', '40'], ['154', '40'], ['22', '40'], ['88', '98'],
               ['198', '40'], ['198', '98'], ['264', '98'], ['308', '40'], ['176', '40'], ['0', '98'], ['132', '98'],
               ['132', '40'], ['176', '98'], ['88', '40'], ['154', '98'], ['220', '40'], ['264', '40'], ['110', '40'],
               ['242', '98'], ['286', '98'], ['0', '40'], ['242', '40'], ['44', '98'], ['220', '98'], ['22', '98'],
               ['308', '98'], ['110', '98']]

# 小图的长宽为：21.98 * 40
offset_list_small = [['264', '0'], ['154', '0'], ['44', '0'], ['242', '0'], ['110', '0'], ['176', '0'], ['88', '0']]

# 获取每张小图的偏移量
def convert_index_to_offset(index, size):
    if size == 1:
        if index < 15:  # 完整的验证码图片是由30个小图片组合而成，共2行15列
            return (index * 22, 0)
        else:
            i = index - 15
            return (i * 22, 58)  # 每张小图的大小为22*58
    elif size == 2:
        return (index * 22, 116)


# 获取每张小图的坐标，供抠图时使用
def convert_css_to_offset(off, size):
    # (left, upper)o ----- o
    #         |       |
    #         o ----- o(right, lower)
    if size == 1:
        return (int(off[0]), int(off[1]), int(off[0]) + 21.98, int(off[1]) + 58)
    elif size == 2:
        return (int(off[0]), int(off[1]), int(off[0]) + 21.98, int(off[1]) +40)

# 289,92;256,58;131,43;91,22
# 图片重组
def recombine_captcha(file_id):
    captcha = Image.new('RGB', (22 * 15, 58 * 2 + 40))  # 新建空白图片
    img = Image.open('./capture/capture1_{}.png'.format(file_id))  # 实例化原始图片Image对象
    for i, off in enumerate(offset_list):
        box = convert_css_to_offset(off, 1)  # 根据css backgound-position获取每张小图的坐标
        regoin = img.crop(box)  # 抠图
        offset = convert_index_to_offset(i, 1)  # 获取当前小图在空白图片的坐标
        captcha.paste(regoin, offset)  # 根据当前坐标将小图粘贴到空白图片
    for i, off in enumerate(offset_list_small):
        box = convert_css_to_offset(off, 2)  # 根据css backgound-position获取每张小图的坐标
        regoin = img.crop(box)  # 抠图
        offset = convert_index_to_offset(i, 2)  # 获取当前小图在空白图片的坐标
        captcha.paste(regoin, offset)  # 根据当前坐标将小图粘贴到空白图片
    capture_2 = Image.open('./capture/text.png')
    captcha.paste(capture_2, (154, 116))

    captcha.save('./capture/capture2_{}.png'.format(file_id))

if __name__ == '__main__':
    recombine_captcha('-5543')

左图为网页原图，右图为拼接复原的图形，至此图片处理的过程结束，接下来可以进行大码获取文字坐标发送给服务器校验验证码是否正确。