Python crawler selection 12 episodes (captcha anti-climbing technology)

世界上最恶心的验证码

Insert picture description hereInsert picture description hereInsert picture description here

  • Right, the concubine can't do it

1. Image verification code

1.1 What is an image verification code

  • CAPTCHA is the abbreviation of "Completely Automated Public Turing test to tell Computers and Humans Apart", which is a public automatic program that distinguishes whether a user is a computer or a human.

1.2 The role of verification code

  • Prevent malicious password cracking, ticket swiping, forum irrigation, and page swiping. Effectively prevent a hacker from making continuous login attempts to a specific registered user by brute force cracking using a specific program. In fact, using verification codes is a common method for many websites (such as China Merchants Bank’s online personal banking, Baidu community), we use A relatively simple way to achieve this function. Although logging in is a bit more troublesome, this function is still necessary and important for the password security of netizens.

1.3 Use scenarios of image verification codes in crawlers

  • registered
  • log in
  • When frequently sending requests, the server pops up a verification code for verification

1.4 Image verification code processing scheme

  • Manual input (input)
    This method is limited to the case of continuous use after logging in once
  • Image recognition engine analysis
    Use the optical recognition engine to process the data in the picture, currently it is often used for picture data extraction, and less used for verification code processing
  • Coding platform
    Common verification code solutions for crawlers

2. Picture recognition engine

OCR (Optical Character Recognition) refers to the software that uses a scanner or digital camera to scan text data into image files, and then analyzes and processes the image files to automatically recognize and obtain text information and layout information.

2.1 What is tesseract

  • Tesseract, an open source OCR engine developed by HP Labs and maintained by Google, is characterized by open source, free, multi-language and multi-platform support.
  • Project address: https://github.com/tesseract-ocr/tesseract

2.2 Installation of image recognition engine environment

1 Engine installation

  • Directly execute commands in the mac environment
brew install --with-training-tools tesseract
  • Installation in windows environment
    can be installed through the exe installation package, and the download address can be found from the wiki in the GitHub project. After the installation is complete, remember to add the directory of the Tesseract executable file to the PATH to facilitate subsequent calls.

  • Installation in linux environment

sudo apt-get install tesseract-ocr

2 Python library installation

# PIL用于打开图片文件
pip/pip3 install pillow

# pytesseract模块用于从图片中解析数据
pip/pip3 install pytesseract

2.3 Use of image recognition engine

  • The data in the opened image file can be extracted into string data through the image_to_string method of the pytesseract module. The specific method is as follows
from PIL import Image
import pytesseract

im = Image.open()

result = pytesseract.image_to_string(im)

print(result)

2.4 Expansion of the use of image recognition engine

    微软Azure 图像识别:https://azure.microsoft.com/zh-cn/services/cognitive-services/computer-vision/
    有道智云文字识别:http://aidemo.youdao.com/ocrdemo
    阿里云图文识别:https://www.aliyun.com/product/cdi/
    腾讯OCR文字识别:https://cloud.tencent.com/product/ocr

3. Coding platform

3.1 Use of coding platform

Many websites now use verification codes for anti-crawling, so in order to better obtain data, you need to understand how to use the verification codes in the code-coding platform crawler

3.2 Common coding platforms

  1. 超级鹰http://www.chaojiying.com/api.html

  2. 图鉴http://www.ttshitu.com/docs/index.html

    Able to solve universal verification code recognition

3.3 The use of cloud coding

Let’s take cloud coding as an example to understand how to use the coding platform

4. Common types of verification codes

4.1 The url address remains unchanged, and the verification code remains unchanged

This is a very simple type in the verification code. The corresponding one only needs to obtain the address of the verification code, and then request and identify it through the coding platform.

4.2 The url address remains unchanged, the verification code changes

This type of verification code is a more common type. For this type of verification code, you need to think about:

During the login process, assuming that the verification code I entered is correct, how does the other party's server determine that the verification code I entered is the verification code displayed on my screen, rather than other verification codes?

When obtaining a webpage, requesting a verification code, and when submitting a verification code, the other party's server must have passed some method to verify that the verification code I obtained before and the verification code I submitted last are the same verification code, then what is this method? What?

Obviously, it is achieved through cookies, so correspondingly, in the request page, request the verification code, and submit the verification code, you need to ensure the consistency of the cookie. You can use requests.session to solve this problem.

Guess you like

Origin blog.csdn.net/weixin_38640052/article/details/108310602