Usually when we visit a web page, we will enter data through an input box, the web page will send a POST, GET or other forms to initiate a request to the server, and after success, the data will be returned to the front desk for display. The following is a brief introduction to the requests library of python.
The prerequisite is to install python and requests library first.
Install requests:
pip install requests
Request test url = http://www.test.com
One, GET request
1. No request parameters: access to a URL link directly to get data
result = requests.get(url=url)
print(result.status_code) # 请求状态
print(result.url)# 请求url
print(result.text) # 请求结果
2. There are request parameters: key-value pairs form parameters
result = requests.get(url=url, params={
'keyword1':'val1','keyword2':'val2'})
#或者可以直接先拼接url
#new_url = url + '?keyword1=' + val1 + '&keyword1=' +val2
#result = requests.get(url=new_url)
print(result.status_code) # 请求状态
print(result.url)# 请求url
print(result.text) # 请求结果
3. There are request header parameters: key-value pairs form parameters
header = {
'Host': 'test.com',
'Content-Type': 'application/json; charset=UTF-8',
}
result = requests.get(url=url, header=header)
print(result.status_code) # 请求状态
print(result.url)# 请求url
print(result.text) # 请求结果
Two, POST request
1. The requested result set is application/x-www-form-urlencoded
result = requests.post(url=url,data={
'keyword1':'val1','keyword2':'val2'},headers={
'Content-Type':'application/x-www-form-urlencoded'})
print(result.status_code) # 请求状态
print(result.url)# 请求url
print(result.text) # 请求结果
2. The requested result set is multipart/form-data
result = requests.post(url=url,data={
'keyword1':'val1','keyword2':'val2'},headers={
'Content-Type':'multipart/form-data'})
print(result.status_code) # 请求状态
print(result.url)# 请求url
print(result.text) # 请求结果
3. The requested result set is application/json
import json
data = {
'keyword1':'val1','keyword2':'val2'}
json_data = json.dumps(data)
result = requests.post(url=url,data=json_data,headers={
'Content-Type':'application/json'})
print(result.status_code) # 请求状态
print(result.url)# 请求url
print(result.text) # 请求结果
As shown in the figure below:
Let's talk about the pit I encountered before, the request method is POST, and it is a request in the form of Request Payload.
At first I thought it was the same as form-data, only one url and data was passed, and the data was not formatted into JSON, resulting in a status of 415: The server could not process the media format attached to the request. After consulting, the format and request header were changed. The data is returned smoothly.
The complete demo is as follows:
import json
import requests
import datetime
import re, urllib.request, lxml.html, http.cookiejar
url = 'http://test.com/products'
# payloadData数据
payload_data = {
'keyword1': "val1", 'keyword2': "val2"}
# 请求头设置
payload_header = {
'Host': 'test.com',
'Content-Type': 'application/json; charset=UTF-8',
}
# 下载超时
timeout = 30
# 代理IP
# proxy_list = {"HTTP":'http://210.22.5.117"3128',"HTTP":'http://163.172.189.32:8811',"HTTP":'http://180.153.144.138:8800'}
json_data = json.dumps(payload_data)
# allow_redirects 是否重定向
# result = requests.post(url=url, data=json_data, headers=payloadHeader, timeout=timeout, proxies=proxy_list, allow_redirects=True)
result = requests.post(url, data=json_data, headers=payload_header, timeout=timeout, allow_redirects=True)
# 下面这种直接填充json参数的方式也OK
# result = requests.post(url, json=json_data, headers=payload_header)
print("请求耗时:{0}, 状态码:{1}, 结果:{2}".format(datetime.datetime.now(),res.status_code,res.text))
Three, need to simulate login before sending Post request
Sometimes you want to simulate some subtle operations on the page, for example, after logging in, you need to use an ajax request to modify the data on the front end. If it is only a very small number of changes, then the front-end direct operation is faster. If it is a large-scale modification, you still have to use the program to traverse the modification.
Login page:
First open F12 to enter the developer mode, then just enter the data in the form above, click login, although it is wrong login data, we are just to view the data format submitted by the login request, as shown below:
Some of them are not the hidden values we entered. We need to get them from the form in the page source code, right-click to view the page source code, and search for the values of "__VIEWSTATE", "__VIEWSTATEGENERATOR", "__EVENTVALIDATION" that were not entered by ourselves in the original page ,E.g:
In other words, we have to visit the source code of the page in advance and parse to obtain the above attribute values:
import requests, string
import re, urllib.request, lxml.html, http.cookiejar
login_url = "http://test.com/Login.aspx"
response = urllib.request.urlopen(login_url)
f = response.read()
doc = lxml.html.document_fromstring(f)
VIEWSTATE = doc.xpath("//input[@id='__VIEWSTATE']/@value")
VIEWSTATEGENERATOR = doc.xpath("//input[@id='__VIEWSTATEGENERATOR']/@value")
EVENTVALIDATION = doc.xpath("//input[@id='__EVENTVALIDATION']/@value")
After getting these, you have to put these values back into Form-Data (in the form data):
from urllib.parse import quote
login_data = urllib.parse.urlencode({
'__EVENTTARGET' : '',
'__EVENTARGUMENT' : '',
'__VIEWSTATE' : VIEWSTATE[0],
'__VIEWSTATEGENERATOR' : VIEWSTATEGENERATOR[0],
'__EVENTVALIDATION' : EVENTVALIDATION[0],
'TextCustomerID' : "真实商户号",
'TextAdminName' : '真实用户名',
'TextPassword' : '真实密码',
'btnLogin.x' : 40,
'btnLogin.y' : 10
}).encode('utf-8')
The encoding of login parameters is very important. If utf-8 encoding is not performed, the following error will be reported:
Traceback (most recent call last):
File "c:\users\user\appdata\local\programs\python\python38\lib\http\client.py", line 965, in send
self.sock.sendall(data)
File "c:\users\user\appdata\local\programs\python\python38\lib\ssl.py", line 1201, in sendall
with memoryview(data) as view, view.cast("B") as byte_view:
TypeError: memoryview: a bytes-like object is required, not 'str'
With the form data, the next step is to get the request header:
header = {
'Host': 'www.test.com',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
'Accept-Encoding': 'gzip, deflate',
'Content-Type': 'application/x-www-form-urlencoded',
'Origin': 'http://www.test.com',
'Connection': 'keep-alive',
'Referer': 'http://www.test.com/Login.aspx',
'Upgrade-Insecure-Requests': 1
}
Simulate login and save cookies:
#模拟登录请求
login_request = urllib.request.Request(login_url, login_data, Headers)
#创建cookie,利用cookie实现持久化登录
cj = http.cookiejar.CookieJar()
opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
login_result = opener.open(login_request)
After the final simulation login, if you want to collect data on a certain page, you can also access the page link through urllib.request.urlopen to read the page source code for data collection. If there is a batch of data that needs to be Post/Get processed, then you can get the data to be processed, and then traverse and initiate a Post or Get request:
import time, random
var datas = {
.....}
for data in datas:
response = requests.get(url, headers = headers, data=json_data, cookies = cj)
# 或
response = requests.post(url, headers = headers, data=json_data, cookies = cj)
time.sleep(random.randint(3, 5))