In the era of big data, many people will use crawlers to collect some data on the Internet, but some websites will have some strategies for anti-crawling. When crawling some web pages, they often encounter a hurdle when logging in to the interface. Now most When logging in, users will be asked to fill in a verification code. The verification code has a variety of forms, such as static pictures, dynamic verification codes, sliding, 12306 verification modes, and even SMS verification codes. Although the current technology can also solve image recognition, if the verification scheme is changed on the website, the entire algorithm may be overturned. Obviously, cracking the verification code by force is a thankless strategy. You can use the request module in python to perform a simulated login, in which cookies under reqeust.session are used to skip the login link. The debugging browser in this article is chrome. Different browsers may be different, but the logic is the same of.
Get request header information
Extract the request header data
agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36"
headers = {
"HOST": "xxx.com",
"Referer": "http://xxx.com/Manager/Main.aspx",
"User-Agent": agent
}
Get cookies
This step is more important, that is, to successfully log in once to obtain the login information, F12 and then F12>Application>Cookies to find the cookie data after successful login.
The coding in the picture is for personal privacy, so it is temporarily erased, and then the Name and Value values are copied out and stored in the variable in the form of a dictionary.
cookies = {
'xxx_cookie_time':'2020-04-28+10%3a59%3a19',
'xxx_cookie_language': 'zh_CN',
'ASP.NET_SessionId': 'v0vszqppwpxxxxxxxx',
'ValidCode':'OicQ%2bxxxx',
'xxx_session_id':'FUl0%2b4kCmyEyxxxxxxxxxx',
'_ati':'1733720xxxx'
}
Create session and implement page login
Then create a session object, assign headers and cookies to the session
import requests
session = requests.session()
session.headers = headers
requests.utils.add_dict_to_cookiejar(session.cookies, cookies)
Among them, it is worth noting that session.headers can be a dict, so direct assignment is no problem, while session.cookies must be <class'requests.cookies.RequestsCookieJar'>, so requests.utils.add_dict_to_cookiejar should be used for assignment. Now that we have a real-time login session, we can visit any webpage of the website normally.
url='http://xxx.com/Order/CodeByOrder.aspx?OrderCode=xxxxxx'
response = session.get(url)
>>> response
<Response [200]>
Now we can successfully obtain the page source code, and then we can analyze and extract the data we need.
If you want to send a GET/POST request, you can cache the existing cookie, and then pass it back to the requests as parameters to realize those operations that require login to send the request:
cookies_dict = requests.utils.add_dict_to_cookiejar(session.cookies, cookies)
resget = requests.get(url, headers = headers, cookies = cookies_dict) # GET
respost = requests.post(url, headers = headers, cookies = cookies_dict) # POST
print(resget.text)
print(respost.text)
Note: Use cookies to simulate the login method. Once the cookies that have been logged in once are invalid, you have to log in again to get cookies.
Complete code
import requests
agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.122 Safari/537.36"
headers = {
"HOST": "xxx.com",
"Referer": "http://xxx.com/Manager/Main.aspx",
"User-Agent": agent
}
cookies = {
'xxx_cookie_time':'2020-04-28xxxx',
'xxx_cookie_language': 'zh_CN',
'ASP.NET_SessionId': 'v0vszqppwpxxxxx',
'ValidCode':'OicQ%2b2xxxx',
'xxx_session_id':'FUl0%2b4kCmyEyxxxxxxxxxx',
'_ati':'1733720xxxx'
}
session = requests.session()
session.headers = headers
requests.utils.add_dict_to_cookiejar(session.cookies, cookies)
url='http://xxx.com/Order/CodeByOrder.aspx?OrderCode=xxxxxx'
response = session.get(url)
>>> response
<Response [200]>