The most complete in history! Python crawler requests library (with case)

1. Introduction to requests library

Requests is a simple and elegant HTTP library designed for humans. The requests library is a native HTTP library, which is easier to use than the urllib3 library. The requests library sends native HTTP 1.1 requests without manually adding query strings to URLs or form-encoding POST data. Compared with urllib3 library, requests library has fully automatic Keep-alive and HTTP connection pool functions. The requests library contains the following features.

❖ 1Keep-Alive & Connection Pool

❖ Internationalized domain names and URLs

❖ Sessions with persistent cookies

❖ Browser-based SSL authentication

❖ Automatic content decoding

❖ Basic/digest authentication

❖ Elegant key/value Cookie

❖ Automatic decompression

❖ Unicode response body

❖ HTTP(S) proxy support

❖ Upload files in parts

❖ Streaming download

❖ Connection timed out

❖ Chunked requests

❖ Support .netrc

1.1 Installation of Requests

pip install requests

1.2 Basic use of Requests

Code Listing 1-1: Send a get request and view the returned result

import requests
url = 'http://www.tipdm.com/tipdm/index.html' # 生成get请求
rqg = requests.get(url)
# 查看结果类型
print('查看结果类型:', type(rqg))
# 查看状态码
print('状态码:',rqg.status_code)
# 查看编码
print('编码 :',rqg.encoding)
# 查看响应头
print('响应头:',rqg.headers)
# 打印查看网页内容
print('查看网页内容:',rqg.text)
查看结果类型:<class ’requests.models.Response’>
状态码:200
编码 :ISO-8859-1
响应头:{’Date’: ’Mon, 18 Nov 2019 04:45:49 GMT’, ’Server’: ’Apache-Coyote/1.1’, ’
Accept-Ranges’: ’bytes’, ’ETag’: ’W/"15693-1562553126764"’, ’Last-Modified’: ’
Mon, 08 Jul 2019 02:32:06 GMT’, ’Content-Type’: ’text/html’, ’Content-Length’: ’
15693’, ’Keep-Alive’: ’timeout=5, max=100’, ’Connection’: ’Keep-Alive’}

1.3 Request basic request method

You can send all http requests through the requests library:

requests.get("http://httpbin.org/get") #GET请求
requests.post("http://httpbin.org/post") #POST请求
requests.put("http://httpbin.org/put") #PUT请求
requests.delete("http://httpbin.org/delete") #DELETE请求
requests.head("http://httpbin.org/get") #HEAD请求
requests.options("http://httpbin.org/get") #OPTIONS请求

2. Use Request to send GET request

One of the most common requests in HTTP is the GET request. Let's take a closer look at how to use requests to construct a GET request.

GET parameter description: get(url, params=None, **kwargs):

❖ URL: URL to be requested

❖ params : (optional) dictionary, list of tuples or bytes sent for the request's query string

❖ **kwargs: variable-length keyword arguments

First, build the simplest GET request, the link of the request is http://httpbin.org/get, the website will judge that if the client initiates a GET request, it will return the corresponding request information, as follows is to use requests to build a GET request

import requests
r = requests.get(http://httpbin.org/get)
print(r.text)
{
"args": {},
"headers": {
"Accept": "*/*",
"Accept-Encoding": "gzip, deflate",
"Host": "httpbin.org",
"User-Agent": "python-requests/2.24.0",
"X-Amzn-Trace-Id": "Root=1-5fb5b166-571d31047bda880d1ec6c311"
},
"origin": "36.44.144.134",
"url": "http://httpbin.org/get"
}

It can be found that we have successfully initiated a GET request, and the returned result contains information such as request header, URL, and IP. So, for GET requests, if you want to attach additional information, how do you generally add it?

2.1 Send a request with headers

First, we try to request Zhihu’s home page information

import requests
response = requests.get(’https://www.zhihu.com/explore’)
print(f"当前请求的响应状态码为:{response.status_code}")
print(response.text)

The response status code of the current request is: 400

400 Bad Request


openresty

Here we find that the status code of the response is 400, indicating that our request failed, because Zhihu has discovered that we are a crawler, so we need to disguise the browser and add the corresponding UA information.

import requests
headers = {"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’}
response = requests.get(’https://www.zhihu.com/explore’, headers=headers)
print(f"当前请求的响应状态码为:{response.status_code}")
# print(response.text)

The response status code of the current request is: 200

<!doctype html>

.......

Here we have added the headers information, which contains the User-Agent field information, which is the browser identification information. Apparently our disguise succeeded! This method of masquerading as a browser is one of the simplest anti-crawling measures.

GET parameter description: the method of sending the request with the request header

requests.get(url, headers=headers)

The -headers parameter receives request headers in the form of a dictionary

-The field name of the request header is used as the key, and the value corresponding to the field is used as the value

practise

Request Baidu's homepage https://www.baidu.com, request to carry headers, and print the requested header information!

untie

import requests
url = 'https://www.baidu.com'
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
# 在请求头中带上User-Agent,模拟浏览器发送请求
response = requests.get(url, headers=headers)
print(response.content)
# 打印请求头信息
print(response.request.headers)

2.2 Send a request with parameters

When we use Baidu search, we often find that there will be a '?' in the url address, then the question mark is the request parameter, also called the query string!

Usually we don’t just visit basic web pages, especially when crawling dynamic web pages, we need to pass different parameters to obtain different content; GET has two methods for passing parameters, you can add parameters directly in the link or use params to add parameters.

2.2.1 Carry parameters in url

Initiate a request directly to the url with parameters

import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
url = ’https://www.baidu.com/s?wd=python’
response = requests.get(url, headers=headers)

2.2.2 Carry parameter dictionary through params

1. Build a dictionary of request parameters

2. Bring the parameter dictionary when sending a request to the interface, and set the parameter dictionary to params

import requests
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
# 这是目标url
# url = ’https://www.baidu.com/s?wd=python’
# 最后有没有问号结果都一样
url = ’https://www.baidu.com/s?’
# 请求参数是一个字典 即wd=python
kw = {’wd’: ’python’}
# 带上请求参数发起请求,获取响应
response = requests.get(url, headers=headers, params=kw)
print(response.content)

Judging from the running results, the requested link is automatically constructed as:

http://httpbin.org/get?key2=value2&key1=value1 。

In addition, the return type of the webpage is actually the str type, but it is very special and is in JSON format. Therefore, if you want to directly parse the returned result and get a dictionary format, you can directly call the json() method. Examples are as follows:

import requests
r = requests.get("http://httpbin.org/get")
print( type(r.text))
print(r.json())
print( type(r. json()))

< class ’str’ >

{ ’args’ : {}, ’headers’ : { ’Accept’ : ’*/*’ , ’Accept-Encoding’ : ’gzip, deflate’ , ’Host’

’httpbin.org’ , ’User-Agent’ : ’python-requests/2.24.0’ , ’X-Amzn-Trace-Id’ : ’

Root=1-5fb5b3f9-13f7c2192936ec541bf97841’ }, ’origin’ : ’36.44.144.134’ , ’url’ : ’

http://httpbin.org/get’ }

< class ’dict’ >

It can be found that by calling the json() method, the returned string in JSON format can be converted into a dictionary. However, it should be noted that if the returned result is not in JSON format, a parsing error will occur and a json.decoder.JSONDecodeError exception will be thrown.

Supplementary content, the received dictionary string will be automatically encoded and sent to the url, as follows:

import requests
headers = {’User-Agent’: ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36’}
wd = ’张三同学’
pn = 1
response = requests.get(’https://www.baidu.com/s’, params={’wd’: wd, ’pn’: pn},
headers=headers)
print(response.url)

# The output is: https://www.baidu.com/s?wd=%E9%9B%A8%E9%9C%93%E5%90%8

C%E5%AD%A6&pn=1

# It can be seen that the url has been automatically encoded

The above code is equivalent to the following code, params encoding conversion is essentially using urlencode

import requests
from urllib.parse import urlencode
headers = {’User-Agent’: ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko)
wd = ’张三同学’
encode_res = urlencode({’k’: wd}, encoding=’utf-8’)
keyword = encode_res.split(’=’)[1]
print(keyword)
# 然后拼接成url
url = ’https://www.baidu.com/s?wd=%s&pn=1’ % keyword
response = requests.get(url, headers=headers)
print(response.url)

# The output is: https://www.baidu.com/s?wd=%E9%9B%A8%E9%9C%93%E5

%90%8C%E5%AD%A6&pn=1

2.3 Crawl web pages using GET requests

The above request link returns a string in JSON format, so if you request a normal web page, you will definitely get the corresponding content!

import requests
import re
headers = {"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’}
response = requests.get(’https://www.zhihu.com/explore’, headers=headers)
result = re.findall("(ExploreSpecialCard-contentTitle|ExploreRoundtableCard
questionTitle).*?>(.*?)</a>", response.text)
print([i[1] for i in result])

[ ' What's delicious in Xi'an Huimin Street? ' , ' What treasure shops are worth visiting in Xi'an? ' , 'Which commercial districts in Xi'an carry your youth? ' , ' What good driving habits do you have that you can share? ' , ' Are there any driving tips that only experienced drivers know? ' , 'Attention those who have a car, everyone should master these driving knowledge, it can save lives at critical moments' , 'Welcome to Landing! Zhihu Member Recruitment Notice' , 'Planet landing question: Give you ten yuan to travel to the future, how can you make a living? ' , 'Planet landing question: Which kind of "super energy" in Zhihu universe do you most wish to have? How would you use it? ' , 'Norwegian salmon, the origin is important' , 'What are the most attractive places in Norway? ' , ' What is it like to live in Norway? ' , ' How do you view the mass production of BOE's AMOLED flexible screen? What are the future prospects? ' , 'Can flexible screens revolutionize the mobile phone industry? ' , 'What is an ultra-thin bendable flexible battery? Will it have a significant impact on smartphone battery life? ' , ' How can you learn art well and get high marks in the art test with zero foundation in art? ' , 'Is Tsinghua Academy of Fine Arts despised?' , 'Are art students really bad? ' , ' How should a person live this life? ' , 'What should a person pursue in his life? ' , 'Will human beings go crazy after knowing the ultimate truth of the world?' , 'Is anxiety due to lack of ability? ' , 'What kind of experience is social phobia? ' , ' Is the saying "When you're busy you don't have time to be depressed" reasonable? ' ]

Here we have added the headers information, which contains the User-Agent field information, which is the browser identification information. If this is not added, Zhihu will prohibit crawling.

Grabbing Binary Data In the above example, we are grabbing a Zhihu page, which actually returns an HTML document.

What should I do if I want to grab pictures, audio, video and other files? Files such as pictures, audio, and video are essentially composed of binary codes. Because of the specific storage format and corresponding parsing method, we can see these various multimedia.

So, if you want to grab them, you need to get their binary code. Let's take GitHub's site icon as an example:

import requests
response = requests.get("https://github.com/favicon.ico")
with
open(’github.ico’, ’wb’) as f:
f.write(response.content)

The two properties of the Response object, one is text and the other is content. The former represents string type text, and the latter represents bytes type data. Similarly, audio and video files can also be obtained in this way.

2.4 Carry cookie in Headers parameter

Websites often use the Cookie field in the request header to maintain the user's access status, so we can add Cookie to the headers parameter to simulate the request of ordinary users.

2.4.1 Obtaining Cookies

In order to be able to obtain the logged-in page through crawlers, or to solve anti-crawling through cookies, it is necessary to use request to process cookie-related requests:

import requests
url = ’https://www.baidu.com’
req = requests.get(url)
print(req.cookies)
# 响应的cookies
for key, value in req.cookies.items():
print(f"{key} = {value}") 

<RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

BDORZ = 27315

Here we can get Cookies successfully by first calling the cookies attribute, and we can find that it is of type RequestCookieJar. Then use the items() method to convert it into a list of tuples, traverse and output the name and value of each Cookie, and realize the traverse and analysis of Cookie.

2.4.2 Login with Cookies

The advantage of bringing cookies and sessions: you can request the page after login.

Disadvantages of bringing cookies and sessions: a set of cookies and sessions often correspond to a user's request too quickly, and the number of requests is too many, which is easily recognized as a crawler by the server.

Try not to use cookies when you don't need cookies, but in order to get the page after login, we must send a request with cookies. We can directly use cookies to maintain the login status. Let's take Zhihu as an example to illustrate. First log in to Zhihu and copy the Cookie content in Headers.

➢ Copy User-Agent and Cookie from browser

➢ The request header field and value in the browser must be consistent with the headers parameter

➢ The value corresponding to the Cookie key in the headers request parameter dictionary is a string

import requests
import re
# 构造请求头字典
headers = {
# 从浏览器中复制过来的User-Agent
"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (
KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’,
# 从浏览器中复制过来的Cookie
"cookie": ’xxx这里是复制过来的cookie字符串’}
# 请求头参数字典中携带cookie字符串
response = requests.get(’https://www.zhihu.com/creator’, headers=headers)
data = re.findall(’CreatorHomeAnalyticsDataItem-title.*?>(.*?)</div>’,response.text)
print(response.status_code)
print(data)

When we make a request without cookies:

import requests
import re
headers = {"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’}
response = requests.get(’https://www.zhihu.com/creator’, headers=headers)
data = re.findall(’CreatorHomeAnalyticsDataItem-title.*?>(.*?)</div>’,response.text)
print(response.status_code)
print(data)

200

[]

It is empty in the printed output, and compared with the two, the headers parameter is successfully used to carry the cookie to obtain the page that can only be accessed after login!

2.4.3 Use of cookies parameters

In the previous section, we carried cookies in the headers parameter, or we can use special cookies parameters.

❖ 1. Format of cookies parameter: dictionary

cookies = "cookie 的 name":"cookie 的 value"

➢ The dictionary corresponds to the Cookie string in the request header, and each pair of dictionary key-value pairs is separated by a semicolon and a space

➢ The left side of the equal sign is the name of a cookie, which corresponds to the key of the cookies dictionary

➢ The right side of the equal sign corresponds to the value of the cookies dictionary

❖ 2. How to use the parameters of cookies

response = requests.get(url, cookies)

❖ 3. Convert the cookie string to the dictionary needed for the cookies parameter:

cookies_dict = { cookie . split ( ’=’ ) [ 0 ]: cookie . split ( ’=’ ) [- 1 ] for cookie in

cookies_str . split ( ’; ’ ) }

❖ 4. Note: Cookies generally have an expiration time, and once expired, they need to be obtained again

response = requests.get(url, cookies)
import requests
import re
url = ’https://www.zhihu.com/creator’
cookies_str = ’复制的cookies’
headers = {"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’}
cookies_dict = {cookie.split(’=’, 1)[0]:cookie.split(’=’, 1)[-1] for cookie in
cookies_str.split(’; ’)}
# 请求头参数字典中携带cookie字符串
resp = requests.get(url, headers=headers, cookies=cookies_dict)
data = re.findall(’CreatorHomeAnalyticsDataItem-title.*?>(.*?)</div>’,resp.text)
print(resp.status_code)
print(data)

200

[ 'In python, how to write this method with different ids but the same class as an integration? ' , ' My parents don't have the money to buy me a computer, what should I do? ' , ' Describe your current living conditions in one sentence? ' ]

2.4.4 Construct the RequestsCookieJar object to set cookies

Here we can also set cookies by constructing the RequestsCookieJar object, the sample code is as follows:

import requests
import re
url = ’https://www.zhihu.com/creator’
cookies_str = ’复制的cookies’
headers = {"user-agent": ’Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36’}
jar = requests.cookies.RequestsCookieJar()
for cookie in cookies_str.split(’;’):
key,value = cookie.split(’=’,1)
jar. set(key,value)
# 请求头参数字典中携带cookie字符串
resp = requests.get(url, headers=headers, cookies=jar)
data = re.findall(’CreatorHomeAnalyticsDataItem-title.*?>(.*?)</div>’,resp.text)
print(resp.status_code)
print(data)

200

[ 'In python, how to write this method with different ids but the same class as an integration? ' , ' My parents don't have the money to buy me a computer, what should I do? ' , ' Describe your current living conditions in one sentence? ' ]

Here we first create a new RequestCookieJar object, then use the split() method to divide the copied cookies, then use the set() method to set the key and value of each cookie, and then call the get() method of requests and pass Just give cookies parameters.

Of course, due to Zhihu's own limitations, the headers parameter is also indispensable, but there is no need to set the cookie field in the original headers parameter. After testing, I found that I can also log in to Zhihu normally.

2.4.5 Method of converting cookieJar object to cookies dictionary

The resposne object obtained by requests has the cookies attribute. The attribute value is a cookieJar type, which contains the local cookie set by the other server. How do we convert this to a dictionary of cookies?

❖ 1. Conversion method

cookies_dict = requests.utils.dict_from_cookiejar(response.cookies)

❖ 2. The object returned by response.cookies is the cookieJar type

❖ 3. The requests.utils.dict_from_cookiejar function returns the cookies dictionary

import requests
import re
url = 'https://www.zhihu.com/creator'
cookies_str = '复制的cookies'
headers = {"user-agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
cookie_dict = {cookie.split('=', 1)[0]:cookie.split('=', 1)[-1] for cookie in
cookies_str.split('; ')}
# 请求头参数字典中携带cookie字符串
resp = requests.get(url, headers=headers, cookies=cookies_dict)
data = re.findall('CreatorHomeAnalyticsDataItem-title.*?>(.*?)</div>',resp.text)
print(resp.status_code)
print(data)
# 可以把一个字典转化为一个requests.cookies.RequestsCookieJar对象
cookiejar = requests.utils.cookiejar_from_dict(cookie_dict, cookiejar=None,
overwrite=True)
type(cookiejar) # requests.cookies.RequestsCookieJar
type(resp.cookies) # requests.cookies.RequestsCookieJar
#构造RequestsCookieJar对象进行cookies设置其中jar的类型也是 requests.cookies.
RequestsCookieJar
#cookiejar转字典
requests.utils.dict_from_cookiejar(cookiejar)

2.5 Timeout settings

In the process of surfing the Internet, we often encounter network fluctuations. At this time, a request that has been waiting for a long time may still have no result.

In the crawler, if a request has no results for a long time, the efficiency of the entire project will become very low. At this time, we need to enforce the request so that he must return the result within a specific time, otherwise an error will be reported.

❖ 1. How to use the timeout parameter timeout

response = requests.get(url, timeout=3)

❖ 2. timeout=3 means: after sending the request, the response will be returned within 3 seconds, otherwise an exception will be thrown

url = 'http://www.tipdm.com/tipdm/index.html'

#Set the timeout to 2

print('timeout is 2:', requests.get(url,timeout=2))

If the timeout is too short, an error will be reported

requests.get(url,timeout = 0.1) #Note time is 0.001

Timeout is 2: <Response [200]>

3. Use Request to send POST request

Thinking: Where do we use POST requests?

1. Login and registration (in the view of web engineers, POST is more secure than GET, and the user's account password and other information will not be exposed in the url address)

2. When large text content needs to be transmitted (POST requests do not require data length)

So in the same way, our crawler also needs to go back to simulate the browser to send a post request in these two places. In fact, sending a POST request is very similar to a GET method, but we need to define the parameters in the data:

Description of POST parameters:

post(url, data=None, json=None, **kwargs):

❖ URL: URL to be requested

❖ data : (optional) dictionary, list of tuples, bytes or file-like object to send in the body of the Request

❖ json: (optional) JSON data, sent to the body of the Request class.

❖ **kwargs: variable-length keyword arguments

import requests
payload = {’key1’: ’value1’, ’key2’: ’value2’}
req = requests.post("http://httpbin.org/post", data=payload)
print(req.text)

3.1 POST sends JSON data

Many times the data you want to send is not encoded as a form, and it is found that this problem occurs especially when crawling many java URLs. If you pass a string instead of a dict, the data will be posted directly. We can use json.dumps() to convert dict into str format; here, in addition to encoding dict by yourself, you can also use json parameters to pass directly, and then it will be automatically encoded.

import json
import requests
url = ’http://httpbin.org/post’
payload = {’some’: ’data’}
req1 = requests.post(url, data=json.dumps(payload))
req2 = requests.post(url, json=payload)
print(req1.text)
print(req2.text)

It can be found that we successfully obtained the returned result, in which the form part is the submitted data, which proves that the POST request was successfully sent.

notes

The requests module sends requests with data, json, and params three methods of carrying parameters.

params are used in get requests, data and json are used in post requests.

The parameters that data can receive are: dictionary, string, byte, file object.

❖ Use json parameters, no matter whether the message is str type or dict type, if you do not specify the content-type in headers

Type, the default is: application/json.

❖ Use the data parameter, the message is a dict type, if you do not specify the type of content-type in the headers, the default application/x

www-form-urlencoded, which is equivalent to the form submitted by ordinary form, will convert the data in the form into key-value pairs. At this time, the data can be obtained from request.POST, and the content of request.body is a=1&b= 2 in this key-value pair form.

❖ Use the data parameter, the message is str type, if you do not specify the type of content-type in headers, the default is application/json.

When submitting data with the data parameter, the content of request.body is in the form of a=1&b=2,

When submitting data with json parameters, the content of request.body is in the form of '"a": 1, "b": 2'

3.2 POST upload file

If we want to use the crawler to upload files, we can use the fifile parameter:

url = 'http://httpbin.org/post'
files = {'file': open('test.xlsx', 'rb')}
req = requests.post(url, files=files)
req.text

If you have a partner who is familiar with WEB development, you should know that if you send a very large file as a multipart/form data request, you may want to make the request into a data stream. By default, requests is not supported, you can use the requests-toolbelt tripartite library.

3.3 Crawl web pages using POST requests

Mainly to find the webpage to be parsed

import requests
# 准备翻译的数据
kw =
input("请输入要翻译的词语:")
ps = {"kw": kw}
# 准备伪造请求
headers = {
# User-Agent:首字母大写,表示请求的身份信息;一般直接使用浏览器的身份信息,伪造
爬虫请求
# 让浏览器认为这个请求是由浏览器发起的[隐藏爬虫的信息]
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (
KHTML, like Gecko) Chrome/85.0.4183.83 Safari/537.36 Edg/85.0.564.41"
}
# 发送POST请求,附带要翻译的表单数据--以字典的方式进行传递
response = requests.post("https://fanyi.baidu.com/sug", data=ps)
# 打印返回的数据
# print(response.content)
print(response.content.decode("unicode_escape"))

4.Requests advanced (1)*Session session maintenance

This part mainly introduces the session maintenance and the use of proxy IP.

In requests, if you directly use methods such as get() or post(), you can indeed simulate web page requests, but this is actually equivalent to a different session, that is to say, you use two browsers to open different pages.

Imagine such a scenario, the first request uses the post() method to log in to a certain website, and the second time you want to get your personal information after successful login, you use the get() method to request the personal information page again. In fact, this is equivalent to opening two browsers. These are two completely unrelated sessions. Can personal information be successfully obtained? Of course not.

 

Some friends may have said, shouldn’t it be enough for me to set the same cookies in two requests? Yes, but it's cumbersome to do so, and we have an easier workaround.

In fact, the main way to solve this problem is to maintain the same session, which is equivalent to opening a new browser tab instead of opening a new browser. But I don't want to set cookies every time, so what should I do? At this time, there is a new weapon, the Session object.

Using it, we can easily maintain a session, and don't have to worry about cookies, it will automatically handle it for us.

The Session class in the requests module can automatically process the cookies generated during the process of sending requests and obtaining responses, so as to achieve the purpose of state preservation. Next we come to learn it.

4.1 The role and application scenarios of requests.session

❖ The role of requests.session

Automatically handle cookies, that is, the next request will bring the previous cookie

❖ Application scenarios of requests.session

Automatically handle cookies generated during multiple consecutive requests

4.2 How to use requests.session

After the session instance requests a website, the local cookie set by the other server will be saved in the session, and the next time the session is used to request the other server, the previous cookie will be brought.

session = requests .session () # instantiate session object

response = session . get ( url , headers , ...)

response = session . post ( url , data , ...)

The parameters of the get or post request sent by the session object are exactly the same as those sent by the requests module.

4.3 Use Session to maintain github login information

❖ Capture the entire process of github login and access to pages that can only be accessed after login

❖ Determine the url address, request method and required request parameters of the login request

-Some request parameters are in the response content corresponding to other urls, which can be obtained by using the re module

❖ Determine the url address and request method of the page that can only be accessed after login

❖ Code completion using requests.session

import requests
import re
# 构造请求头字典
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (
KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',}
# 实例化session对象
session = requests.session()
# 访问登陆页获取登陆请求所需参数
response = session.get(https://github.com/login, headers=headers)
authenticity_token = re.search('name="authenticity_token" value="(.*?)" />',
response.text).group(1) # 使用正则获取登陆请求所需参数
# 构造登陆请求参数字典
data = {
'commit': 'Sign in', # 固定值
'utf8': ' ', # 固定值
'authenticity_token': authenticity_token, # 该参数在登陆页的响应内容中
'login':
input('输入github账号:'),
'password':
input('输入github账号:')}
# 发送登陆请求(无需关注本次请求的响应)
session.post(https://github.com/session, headers=headers, data=data)
# 打印需要登陆后才能访问的页面
response = session.get(https://github.com/settings/profile, headers=headers)
print(response.text)

You can use the text comparison tool for proofreading!

5.Requests Advanced (2)*Use of Proxy

For some websites, the content can be obtained normally after requesting several times during the test. But once large-scale crawling starts, for large-scale and frequent requests, the website may pop up a verification code, or jump to the login authentication page, or even directly block the client's IP, resulting in inaccessibility for a certain period of time.

Well, in order to prevent this from happening, we need to set up a proxy to solve this problem, which requires the use of the proxies parameter. Can be set in this way:

The proxy proxy parameter specifies the proxy ip, so that the corresponding proxy ip forwards the request we send to the proxy server, so let's first understand the proxy ip and the proxy server.

5.1 The process of using the proxy

1. Proxy ip is an ip pointing to a proxy server

2. The proxy server can help us forward the request to the target server

5.2 Forward proxy and reverse proxy

As mentioned earlier, the proxy ip specified by the proxy parameter points to the forward proxy server, so there is a reverse server correspondingly; now let’s understand the difference between the forward proxy server and the reverse proxy server.

❖ Distinguish between forward or reverse proxy from the perspective of the party sending the request

❖ Forwarding the request for the browser or the client (the party that sends the request) is called a forward proxy

- the browser knows the real ip address of the server that ultimately handles the request, such as a VPN

❖ It does not forward the request for the browser or the client (the party that sends the request), but forwards the request for the server that finally processes the request, which is called a reverse proxy

- The browser does not know the real address of the server, such as nginx.

5.3 Classification of proxy ip (proxy server)

❖ According to the degree of anonymity of proxy IP, proxy IP can be divided into the following three categories:

➢ Transparent Proxy: Although transparent proxy can directly "hide" your IP address, you can still find out who you are.

The request headers received by the target server are as follows:

REMOTE_ADDR = Proxy IP

HTTP_VIA = Proxy IP

HTTP_X_FORWARDED_FOR = Your IP

➢ Anonymous Proxy (Anonymous Proxy): With anonymous proxy, others can only know that you use a proxy, but cannot know who you are.

The request headers received by the target server are as follows:

REMOTE_ADDR = proxy IP

HTTP_VIA = proxy IP

HTTP_X_FORWARDED_FOR = proxy IP

➢ High Anonymity Proxy (Elite proxy or High Anonymity Proxy): High Anonymity Proxy makes it impossible for others to find that you are using a proxy, so it is the best choice. ** There is no doubt that using a high-profile proxy works best**.

The request headers received by the target server are as follows:

REMOTE_ADDR = Proxy IP

HTTP_VIA = not determined

HTTP_X_FORWARDED_FOR = not determined

❖ Depending on the protocol used by the website, a proxy service of the corresponding protocol is required.

The protocols used to serve requests from proxies can be categorized as:

➢ http proxy: target url is http protocol

➢ https proxy: the target url is https protocol

➢ socks tunnel proxy (such as socks5 proxy), etc.:

✾ 1. The socks proxy simply transmits data packets, regardless of the application protocol (FTP, HTTP, HTTPS, etc.).

✾ 2. The socks proxy takes less time than the http and https proxies.

✾ 3. The socks proxy can forward http and https requests

5.4 Use of proxies proxy parameters

In order to make the server think that it is not the same client that is requesting; in order to prevent frequent requests to a domain name from being blocked, we need to use proxy ip; then we will learn the basic usage of how the requests module uses proxy ip.

response = requests . get ( url , proxies = proxies )
proxies 的形式:字典
proxies = {
" http ": " http :// 12.34.56.79: 9527 ",
" https ": " https :// 12.34.56.79: 9527 ",
}

Note: If the proxies dictionary contains multiple key-value pairs, the corresponding proxy ip will be selected according to the protocol of the url address when sending the request

import requests
proxies = {
"http": "http://124.236.111.11:80",
"https": "https:183.220.145.3:8080"}
req = requests.get(’http://www.baidu.com’,proxies =proxies)
req.status_code

6. Requests advanced (3)*SSL certificate verification

In addition, requests also provides the function of certificate verification. When sending an HTTP request, it will check the SSL certificate, we can use the verify parameter to control whether to check this certificate. In fact, if the verify parameter is not added, the default is True, and it will be verified to the dynamic.

Now let's test it with requests:

import requests
url = 'https://cas.xijing.edu.cn/xjtyrz/login'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
req = requests.get(url,headers=headers)

SSLError: HTTPSConnectionPool(host= ’cas.xijing.edu.cn’ , port=443): Max retries

exceeded with url: /xjtyrz/login (Caused by SSLError(SSLCertVerificationError(1,

’[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get

local issuer certificate (_ssl.c:1123)’ )))

An error SSL Error is prompted here, indicating that the certificate verification error. So, if an HTTPS site is requested, but the certificate verifies the wrong page, such an error will be reported, so how to avoid this error? Very simple, set the verify parameter to False.

The relevant code is as follows:

import requests
url = 'https://www.jci.edu.cn/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
req = requests.get(url,headers=headers,verify=False)
req.status_code

200

I can't find any web pages that require SSL verification, how angry!

However, we found that a warning was reported and it suggested that we assign a certificate to it. We can block this warning by setting ignore warnings:

import requests
from requests.packages import urllib3
urllib3.disable_warnings()
url = 'https://www.jci.edu.cn/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
req = requests.get(url,headers=headers,verify=False)
req.status_code

200

Or ignore warnings by capturing them to the log:

import logging
import requests
logging.captureWarnings(True)
url = 'https://www.jci.edu.cn/'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36'}
req = requests.get(url,headers=headers,verify=False)
req.status_code

200

Of course, we can also specify a local certificate to be used as the client certificate, this can be a single file (containing key and certificate) or a tuple containing two file paths:

import requests
response = requests.get(https://www.12306.cn’,cert=(’./path/server.crt’,’/path/key
))
print(response.status_code)

200

Of course, the above code is a demonstration example, we need to have crt and key files, and specify their paths. Note that the key of the local private certificate must be in the decrypted state, and the key in the encrypted state is not supported. There are very few URLs like this now!

7. Other contents of the Requests library

7.1 View the response content

After sending the request, the response is naturally obtained. In the above example, we used text and content to get the content of the response. In addition, there are many properties and methods that can be used to obtain other information, such as status codes, response headers, cookies, etc.

Examples are as follows:

import requests
url = 'https://www.baidu.com'
req = requests.get(url)
print(req.status_code)
# 响应状态码
print(req.text)
# 响应的文本内容
print(req.content)
# 响应的二进制内容
print(req.cookies)
# 响应的cookies
print(req.encoding)
# 响应的编码
print(req.headers)
# 响应的头部信息
print(req.url)
# 响应的网址
print(req.history)
# 响应的历史

7.2 View status code and encoding

Use the form of rqg.status_code to view the status code returned by the server, and use the form of rqg.encoding to encode the webpage through the HTTP header information returned by the server. It should be noted that when the Requests library guesses wrong, you need to manually specify the encoding code to avoid garbled characters in the returned web page content.

7.3 Send a get request and manually specify the encoding

Code 1-2: Send a get request and manually specify the encoding

url = 'http://www.tipdm.com/tipdm/index.html'
rqg = requests.get(url)
print('状态码 ',rqg.status_code)
print('编码 ',rqg.encoding)
rqg.encoding = 'utf-8' #手动指定编码
print('修改后的编码 ',rqg.encoding)
# print(rqg.text)

status code

200

coding

ISO-8859-1

modified encoding

utf-8

notes

The method of manually specifying is not flexible, and cannot adapt to different web page encodings in the crawling process, but the method of using the chardet library is relatively simple and flexible. The chardet library is a very good string/file encoding detection module

7.4 Use of chardet library

The detect method of the chartdet library can detect the encoding of a given string, and its syntax is as follows.

chartdet.detect(byte_str)

Common parameters and descriptions of the detect method

byte_str: receive string. A string representing the encoding to be detected. no default

7.5 Use the detect method to detect encoding and specify

Code 1-3: Use the detect method to detect encoding and specify the encoding

import chardet
url = 'http://www.tipdm.com/tipdm/index.html'
rqg = requests.get(url)
print(rqg.encoding)
print(chardet.detect(rqg.content))
rqg.encoding = chardet.detect(rqg.content)['encoding']
# 访问字典元素
print(rqg.encoding)

ISO-8859-1

{ ’encoding’ : ’utf-8’ , ’confidence’ : 0.99, ’language’ : ’’ }

utf-8

7.6 Requests Library Comprehensive Test

Send a complete GET request to the website 'http://www.tipdm.com/tipdm/index.html', which includes link, request header, response header, timeout and status code, and the encoding is set correctly.

Listing 1-6: Generate a complete HTTP request.

# 导入相关的库
import requests
import chardet
# 设置url
url = 'http://www.tipdm.com/tipdm/index.html'
# 设置请求头
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit
/537.36 (KHTML, like Gecko) Chrome/78.0.3904.97 Safari/537.36"}
# 生成GET请求,并设置延时为2
rqg = requests.get(url,headers=headers,timeout = 2)
# 查看状态码
print("状态码 ",rqg.status_code)
# 检测编码(查看编码)
print('编码 ',rqg.encoding)
# 使用chardet库的detect方法修正编码
rqg.encoding = chardet.detect(rqg.content)['encoding']
# 检测修正后的编码
print('修正后的编码: ',rqg.encoding)
#查看响应头
print('响应头:',rqg.headers)
# 查看网页内容
#print(rqg.text)

status code

200

coding

ISO-8859-1

Corrected encoding: utf-8

Response header: { 'Date' : 'Mon, 18 Nov 2019 06:28:56 GMT' , 'Server' : 'Apache-Coyote/1.1' , '

Accept-Ranges’ : ’bytes’ , ’ETag’ : ’W/"15693-1562553126764"’ , ’Last-Modified’ : ’

Mon, 08 Jul 2019 02:32:06 GMT’ , ’Content-Type’ : ’text/html’ , ’Content-Length’ : ’

15693’ , ’Keep-Alive’ : ’timeout=5, max=100’ , ’Connection’ : ’Keep-Alive’ }

Guess you like

Origin blog.csdn.net/y1282037271/article/details/129169619