"Want to learn Python crawler must see series of verification code processing" the use of coding platform

Verification code processing

learning target

  1. Understand the relevant knowledge of the verification code

  2. Master the use of image recognition engine

  3. Understand common coding platforms

  4. Master the method of processing verification codes through the coding platform


1. Picture verification code

1.1 What is an image verification code

  • CAPTCHA is the abbreviation of "Completely Automated Public Turing test to tell Computers and Humans Apart", which is a public automatic program that distinguishes whether a user is a computer or a human.

1.2 The role of verification code

  • Prevent malicious password cracking, ticket swiping, forum irrigation, and page swiping. Effectively prevent a hacker from making continuous login attempts to a specific registered user by brute force cracking using a specific program. In fact, the use of verification codes is a common method for many websites (such as China Merchants Bank’s online personal banking, Baidu community), we use A relatively simple way to achieve this function. Although logging in is a bit more troublesome, this function is still necessary and important for the password security of netizens.

1.3 Use scenarios of image verification codes in crawlers

  • registered

  • log in

  • When frequently sending requests, the server pops up a verification code for verification

1.4 Image verification code processing scheme

  • Manual input (input) This method is limited to the case of continuous use after logging in once

  • Image recognition engine analysis uses optical recognition engine to process the data in the picture, currently it is often used for picture data extraction, less used for verification code processing

  • Verification code solutions commonly used by coding platform crawlers

2. Picture recognition engine

OCR (Optical Character Recognition) refers to the software that uses a scanner or digital camera to scan text data into image files, and then analyzes and processes the image files to automatically recognize and obtain text information and layout information.

2.1 What is tesseract

  • Tesseract, an open source OCR engine developed by HP Labs and maintained by Google, is characterized by open source, free, multi-language and multi-platform support.

  • Project address: https://github.com/tesseract-ocr/tesseract

2.2 Installation of image recognition engine environment

1 Engine installation

  • Directly execute commands in the mac environment

brew install --with-training-tools tesseract
  • The installation in the windows environment can be installed through the exe installation package, and the download address can be found from the wiki in the GitHub project. After the installation is complete, remember to add the directory of the Tesseract executable file to the PATH to facilitate subsequent calls.

  • Installation in linux environment

sudo apt-get install tesseract-ocr

2 Python library installation

# PIL is used to open image files 
PIP / PIP3 install Pillow 
# pytesseract module for parsing data from the picture 
pip / pip3 install pytesseract

2.3 Use of image recognition engine

The data in the opened image file can be extracted into string data through the image_to_string method of the pytesseract module. The specific method is as follows

from PIL import Image
import pytesseract

im = Image.open()

result = pytesseract.image_to_string(im)

print(result)

2.4 Expansion of the use of image recognition engine

    Microsoft Azure image recognition: https: //azure.microsoft.com/zh-cn/services/cognitive-services/computer-vision/ 
    proper way Zhiyun character recognition: http: //aidemo.youdao.com/ocrdemo 
    Ali cloud Text recognition: https://www.aliyun.com/product/cdi/Tencent 
    OCR text recognition: https://cloud.tencent.com/product/ocr

3 coding platform

Let's take cloud coding as an example to understand how to use the coding platform

3.1 Official interface of cloud coding

The following code is provided by the cloud coding platform. A simple modification has been made to implement two methods:

  1. indetify: the response binary number of the incoming picture can be

  2. indetify_by_filepath: The path of the incoming picture can be identified

The places that need to be configured by yourself are:

username = 'whoarewe' # 用户名

password = '***' # 密码

appid = 4283 # appid

appkey = '02074c64f0d0bb9efb2df455537b01c3' # appkey

codetype = 1004 # 验证码类型

The official API provided by Cloud Code is as follows:

#yundama.py
import requests
import json
import time

class YDMHttp:
    apiurl = 'http://api.yundama.com/api.php'
    username = ''
    password = ''
    appid = ''
    appkey = ''

    def __init__(self, username, password, appid, appkey):
        self.username = username
        self.password = password
        self.appid = str(appid)
        self.appkey = appkey

    def request(self, fields, files=[]):
        response = self.post_url(self.apiurl, fields, files)
        response = json.loads(response)
        return response

    def balance(self):
        data = {'method': 'balance', 'username': self.username, 'password': self.password, 'appid': self.appid,
                'appkey': self.appkey}
        response = self.request(data)
        if (response):
            if (response['ret'] and response['ret'] < 0):
                return response['ret']
            else:
                return response['balance']
        else:
            return -9001

    def login(self):
        data = {'method': 'login', 'username': self.username, 'password': self.password, 'appid': self.appid,
                'appkey': self.appkey}
        response = self.request(data)
        if (response):
            if (response['ret'] and response['ret'] < 0):
                return response['ret']
            else:
                return response['uid']
        else:
            return -9001

    def upload(self, filename, codetype, timeout):
        data = {'method': 'upload', 'username': self.username, 'password': self.password, 'appid': self.appid,
                'appkey': self.appkey, 'codetype': str(codetype), 'timeout': str(timeout)}
        file = {'file': filename}
        response = self.request(data, file)
        if (response):
            if (response['ret'] and response['ret'] < 0):
                return response['ret']
            else:
                return response['cid']
        else:
            return -9001

    def result(self, cid):
        data = {'method': 'result', 'username': self.username, 'password': self.password, 'appid': self.appid,
                'appkey': self.appkey, 'cid': str(cid)}
        response = self.request(data)
        return response and response['text'] or ''

    def decode(self, filename, codetype, timeout):
        cid = self.upload(filename, codetype, timeout)
        if (cid > 0):
            for i in range(0, timeout):
                result = self.result(cid)
                if (result != ''):
                    return cid, result
                else:
                    time.sleep(1)
            return -3003, ''
        else:
            return cid, ''

    def post_url(self, url, fields, files=[]):
        # for key in files:
        #     files[key] = open(files[key], 'rb');
        res = requests.post(url, files=files, data=fields)
        return res.text 
        
username = 'whoarewe' # 用户名

password = '***' # 密码

appid = 4283 # appid

appkey = '02074c64f0d0bb9efb2df455537b01c3' # appkey

filename = 'getimage.jpg' # 文件位置

codetype = 1004 # 验证码类型

# 超时
timeout = 60

def indetify(response_content):
    if (username == 'username'):
        print('请设置好相关参数再测试')
    else:
        # 初始化
        yundama = YDMHttp(username, password, appid, appkey)

        # 登陆云打码
        uid = yundama.login();
        print('uid: %s' % uid)

        # 查询余额
        balance = yundama.balance();
        print('balance: %s' % balance)

        # 开始识别,图片路径,验证码类型ID,超时时间(秒),识别结果
        cid, result = yundama.decode(response_content, codetype, timeout)
        print('cid: %s, result: %s' % (cid, result))
        return result

def indetify_by_filepath(file_path):
    if (username == 'username'):
        print('请设置好相关参数再测试')
    else:
        # 初始化
        yundama = YDMHttp(username, password, appid, appkey)

        # 登陆云打码
        uid = yundama.login();
        print('uid: %s' % uid)

        # 查询余额
        balance = yundama.balance();
        print('balance: %s' % balance)

        # 开始识别,图片路径,验证码类型ID,超时时间(秒),识别结果
        cid, result = yundama.decode(file_path, codetype, timeout)
        print('cid: %s, result: %s' % (cid, result))
        return result

if __name__ == '__main__':
    pass

4 Common types of verification codes

4.1 The url address remains unchanged, and the verification code remains unchanged

This is a very simple type in the verification code. The corresponding one only needs to obtain the address of the verification code, and then request and identify it through the coding platform.

4.2 The url address remains unchanged, the verification code changes

This type of verification code is a more common type. For this type of verification code, you need to think about:

During the login process, assuming that the verification code I entered is correct, how does the other party's server determine that the verification code I entered is the verification code displayed on my screen, rather than other verification codes?

When obtaining a webpage, requesting a verification code, and when submitting a verification code, the other party's server must have passed some method to verify that the verification code I obtained before and the verification code I submitted last are the same verification code, then what is this method? What?

Obviously, it is achieved through cookies, so correspondingly, in the request page, request the verification code, and submit the verification code, you need to ensure the consistency of the cookie. You can use requests.session to solve this problem.


summary

  1. Understand the relevant knowledge of the verification code

  2. Master the use of image recognition engine

  3. Understand common coding platforms

  4. Master the method of processing verification codes through the coding platform

Guess you like

Origin blog.csdn.net/weixin_45293202/article/details/114580359