Python crawler uses cookies to skip login verification codes

In the era of big data, many people will use crawlers to collect some data on the Internet, but some websites will have some strategies for anti-crawling. When crawling some web pages, they often encounter a hurdle when logging in to the interface. Now most When logging in, users will be asked to fill in a verification code. The verification code has a variety of forms, such as static pictures, dynamic verification codes, sliding, 12306 verification modes, and even SMS verification codes. Although the current technology can also solve image recognition, if the verification scheme is changed on the website, the entire algorithm may be overturned. Obviously, cracking the verification code by force is a thankless strategy. You can use the request module in python to perform a simulated login, in which cookies under reqeust.session are used to skip the login link. The debugging browser in this article is chrome. Different browsers may be different, but the logic is the same of.

Get request header information

Insert picture description here

Extract the request header data

agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36"
headers = {
    
    
    "HOST": "xxx.com",
    "Referer": "http://xxx.com/Manager/Main.aspx",
    "User-Agent": agent
}

Get cookies

This step is more important, that is, to successfully log in once to obtain the login information, F12 and then F12>Application>Cookies to find the cookie data after successful login.

Insert picture description here

The coding in the picture is for personal privacy, so it is temporarily erased, and then the Name and Value values are copied out and stored in the variable in the form of a dictionary.

cookies = {
    
    
	'xxx_cookie_time':'2020-04-28+10%3a59%3a19',
	'xxx_cookie_language': 'zh_CN',
	'ASP.NET_SessionId': 'v0vszqppwpxxxxxxxx',
	'ValidCode':'OicQ%2bxxxx',
	'xxx_session_id':'FUl0%2b4kCmyEyxxxxxxxxxx',
	'_ati':'1733720xxxx'
}

Create session and implement page login

Then create a session object, assign headers and cookies to the session

import requests
session = requests.session()
session.headers = headers
requests.utils.add_dict_to_cookiejar(session.cookies, cookies)

Among them, it is worth noting that session.headers can be a dict, so direct assignment is no problem, while session.cookies must be <class'requests.cookies.RequestsCookieJar'>, so requests.utils.add_dict_to_cookiejar should be used for assignment. Now that we have a real-time login session, we can visit any webpage of the website normally.

url='http://xxx.com/Order/CodeByOrder.aspx?OrderCode=xxxxxx'
response = session.get(url)
>>> response
<Response [200]>

Now we can successfully obtain the page source code, and then we can analyze and extract the data we need.

If you want to send a GET/POST request, you can cache the existing cookie, and then pass it back to the requests as parameters to realize those operations that require login to send the request:

cookies_dict = requests.utils.add_dict_to_cookiejar(session.cookies, cookies)
resget = requests.get(url, headers = headers, cookies = cookies_dict) # GET
respost = requests.post(url, headers = headers, cookies = cookies_dict) # POST
print(resget.text)
print(respost.text)

Note: Use cookies to simulate the login method. Once the cookies that have been logged in once are invalid, you have to log in again to get cookies.

Complete code

import requests

agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36"
headers = {
    
    
    "HOST": "xxx.com",
    "Referer": "http://xxx.com/Manager/Main.aspx",
    "User-Agent": agent
}

cookies = {
    
    
	'xxx_cookie_time':'2020-04-28xxxx',
	'xxx_cookie_language': 'zh_CN',
	'ASP.NET_SessionId': 'v0vszqppwpxxxxx',
	'ValidCode':'OicQ%2b2xxxx',
	'xxx_session_id':'FUl0%2b4kCmyEyxxxxxxxxxx',
	'_ati':'1733720xxxx'
}
session = requests.session()
session.headers = headers
requests.utils.add_dict_to_cookiejar(session.cookies, cookies)
url='http://xxx.com/Order/CodeByOrder.aspx?OrderCode=xxxxxx'
response = session.get(url)
>>> response
<Response [200]>

Note: Use cookies to simulate login. The default cookie expiration time is until the browser is closed. We often encounter that if the browser is closed and the web page that needs to be logged in is opened again, it is necessary to log in again. This is because the browser is closed and temporary session session The end causes the cookie temporarily stored in the memory to be released, so you need to enter the password again to log in again to regenerate the session, cookie, etc.