Follow An Xian to learn Python web crawler-requests module use "1"

This stage of the course mainly learns the http module of requests, which is mainly used to send requests and get responses. This module has many alternative modules, such as the urllib module, but the requests module is the most used in work. The code of requests is simple and easy. Understand, compared to the bloated urllib module, the crawler code written with requests will be less, and it will be easier to implement a certain function. Therefore, it is recommended that you master the use of this module.

table of Contents

requests module

1. Introduction to the requests module

1.1 The role of the requests module:

1.2 The requests module is a third-party module and needs to be additionally installed in your python (virtual) environment

1.3 The requests module sends a get request

2. response object

2.1 The difference between response.text and response.content:

2.2 Solve Chinese garbled characters by decoding response.content

2.3 Other common attributes or methods of response object

3. The requests module sends a request

3.1 Send a request with header

3.2 Sending a request with parameters

3.3 Carrying cookies in the headers parameter

3.4 Use of cookies parameters

3.5 Method of converting cookieJar object into cookies dictionary


requests module

Knowledge points:

  • Master the use of headers parameters

  • Master sending requests with parameters

  • Master the cookies carried in headers

  • Master the use of cookies parameters

  • Master the conversion method of cookieJar

  • Master the use of the timeout parameter timeout

  • Master the use of proxy ip parameters proxies

  • Master the use of verify parameters to ignore CA certificates

  • Master the requests module to send post requests

  • Master the use of requests.session for state maintenance


Before we learned the basics of crawlers, let’s learn how to implement our crawlers in code.

 

1. Introduction to the requests module

requests document http://docs.python-requests.org/zh_CN/latest/index.html

1.1 The role of the requests module:

  • Send http request and get response data

1.2 The requests module is a third-party module and needs to be additionally installed in your python (virtual) environment

  • pip/pip3 install requests

1.3 The requests module sends a get request

  1. Requirements: Send a request to the Baidu homepage through requests to get the source code of the page

  2. Run the following code and observe the printed output

# 1.2.1-简单的代码实现
import requests 

# 目标url
url = 'https://www.baidu.com' 

# 向目标url发送get请求
response = requests.get(url)

# 打印响应内容
print(response.text)

 

Knowledge point: master the requests module to send get requests


 

2. response object

Observing the results of running the code above, there are a lot of garbled characters; this is caused by the different character sets used in the codec; we try to use the following method to solve the Chinese garbled problem

# 1.2.2-response.content
import requests 

# 目标url
url = 'https://www.baidu.com' 

# 向目标url发送get请求
response = requests.get(url)

# 打印响应内容
# print(response.text)
print(response.content.decode()) # 注意这里!
  1. response.text is the decoded result of the requests module according to the coded character set inferred by the chardet module

  2. The strings transmitted over the network are of the bytes type, so response.text = response.content.decode('predicted coded character set')

  3. We can search in the webpage source code charset, try to refer to the coded character set, pay attention to the inaccuracy

2.1 The difference between response.text and response.content:

  • response.text

    • Type: str

    • Decoding type: The requests module automatically makes an educated guess on the encoding of the response based on the HTTP header, and the inferred text encoding

  • response.content

    • Type: bytes

    • Decoding type: not specified


Knowledge point: master the difference between response.text and response.content


 

2.2 Solve Chinese garbled characters by decoding response.content

  • response.content.decode() Default utf-8

  • response.content.decode("GBK")

  • Common coded character set

    • utf-8

    • gbk

    • gb2312

    • ascii (pronunciation: Asker code)

    • iso-8859-1


Knowledge point: master the use of decode function to solve Chinese garbled in requests.content


 

2.3 Other common attributes or methods of response object

response = requests.get(url)The response is the response object obtained by sending the request; besides text and content, there are other commonly used attributes or methods in the response object:

  • response.urlThe URL of the response; sometimes the URL of the response is not consistent with the URL of the request

  • response.status_code Response status code

  • response.request.headers Response to the corresponding request header

  • response.headers Response header

  • response.request._cookies Respond to the cookie corresponding to the request; return the cookieJar type

  • response.cookies Response cookie (after set-cookie action; return cookieJar type

  • response.json()Automatically convert response content of json string type to python object (dict or list)

# 1.2.3-response其它常用属性
import requests

# 目标url
url = 'https://www.baidu.com'

# 向目标url发送get请求
response = requests.get(url)

# 打印响应内容
# print(response.text)
# print(response.content.decode()) 			# 注意这里!
print(response.url)							# 打印响应的url
print(response.status_code)					# 打印响应的状态码
print(response.request.headers)				# 打印响应对象的请求头
print(response.headers)						# 打印响应头
print(response.request._cookies)			# 打印请求携带的cookies
print(response.cookies)						# 打印响应中携带的cookies

 

Knowledge point: master other common attributes of response object


 

3. The requests module sends a request

3.1 Send a request with header

Let's write a code to get the homepage of Baidu

import requests

url = 'https://www.baidu.com'

response = requests.get(url)

print(response.content.decode())

# 打印响应对应请求的请求头信息
print(response.request.headers)

3.1.1 Thinking

  1. What is the difference between the source code of the Baidu homepage on the browser and the source code of the Baidu homepage in the code?

    • How to view the source code of a webpage:

      • Right click-view webpage source code or

      • Right-check

  2. What is the difference between the response content of the corresponding URL and the source code of the Baidu homepage in the code?

    • The method to view the response content of the corresponding url:

      1. Right-check

      2. Click on Net work

      3. Check Preserve log

      4. refresh page

      5. View Namethe URL under the same column as the browser address barResponse

  3. The source code of the Baidu homepage in the code is very small, why?

    • We need to bring request header information

      Review the concept of crawlers, simulate the browser, deceive the server, and obtain the same content as the browser

    • There are many fields in the request header, among which the User-Agent field is essential, indicating the client's operating system and browser information

3.1.2 Method of sending request with request header

requests.get(url, headers=headers)

  • The headers parameter receives the request headers in the form of a dictionary

  • The field name of the request header is used as the key, and the corresponding value of the field is used as the value

3.1.3 Complete code implementation

Copy User-Agent from the browser to construct the headers dictionary; after completing the code below, run the code to view the result

import requests

url = 'https://www.baidu.com'

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

# 在请求头中带上User-Agent,模拟浏览器发送请求
response = requests.get(url, headers=headers) 

print(response.content)

# 打印请求头信息
print(response.request.headers)

Knowledge points: master the use of headers parameters


3.2 Sending a request with parameters

When we use Baidu to search, we often find that there will be one in the URL address ?, then the request parameter after the question mark is also called the query string

3.2.1 Carrying parameters in the URL

Directly initiate a request to the URL containing parameters

import requests

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

url = 'https://www.baidu.com/s?wd=python'

response = requests.get(url, headers=headers)

3.2.2 Carrying parameter dictionary through params

1. Build a dictionary of request parameters

2. Bring the parameter dictionary when sending the request to the interface, and set the parameter dictionary to params

import requests

headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}

# 这是目标url
# url = 'https://www.baidu.com/s?wd=python'

# 最后有没有问号结果都一样
url = 'https://www.baidu.com/s?'

# 请求参数是一个字典 即wd=python
kw = {'wd': 'python'}

# 带上请求参数发起请求,获取响应
response = requests.get(url, headers=headers, params=kw)

print(response.content)

Knowledge point: master the method of sending a request with parameters


 

3.3 Carrying cookies in the headers parameter

Websites often use the Cookie field in the request header to maintain the user's access state. Then we can add cookies to the headers parameter to simulate the request of ordinary users. Let's take github login as an example:

3.3.1 Analysis of github login and packet capture

  1. Open the browser, right click-check, click Net work, check Preserve log

  2. Visit the url address of github login https://github.com/login

  3. After entering the account password and clicking login, visit a url that needs to log in to get the correct content, for example, click on Your profile in the upper right corner to accesshttps://github.com/USER_NAME

  4. After determining the url, determine the User-Agent and Cookie in the request header information required to send the request

3.3.2 Completing the code

  • Copy User-Agent and Cookie from the browser

  • The request header fields and values ​​in the browser must be consistent with those in the headers parameter

  • The value corresponding to the Cookie key in the headers request parameter dictionary is a string

import requests

url = 'https://github.com/USER_NAME'

# 构造请求头字典
headers = {
    # 从浏览器中复制过来的User-Agent
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',
    # 从浏览器中复制过来的Cookie
    'Cookie': 'xxx这里是复制过来的cookie字符串'
}

# 请求头参数字典中携带cookie字符串
resp = requests.get(url, headers=headers)

print(resp.text)

3.3.3 Run code verification results

Search for title in the printed output results. If the title text content in html is your github account, successfully use the headers parameter to carry the cookie to get the page that can be accessed after login

Knowledge point: master the cookies carried in headers


 

3.4 Use of cookies parameters

In the previous section, we carried cookies in the headers parameter, and we can also use special cookies parameters

  1. The form of the cookies parameter: dictionary

    cookies = {"cookie的name":"cookie的value"}

    • The dictionary corresponds to the Cookie string in the request header, and each pair of dictionary key-value pairs are separated by semicolons and spaces

    • The left side of the equal sign is the name of a cookie, corresponding to the key of the cookies dictionary

    • The right side of the equal sign corresponds to the value of the cookies dictionary

  2. How to use cookies parameters

    response = requests.get(url, cookies)

  3. The dictionary required to convert the cookie string to the cookies parameter:

    cookies_dict = {cookie.split('=')[0]:cookie.split('=')[-1] for cookie in cookies_str.split('; ')}

  4. Note: Cookies generally have an expiration time, once they expire, they need to be re-acquired

import requests

url = 'https://github.com/USER_NAME'

# 构造请求头字典
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
}
# 构造cookies字典
cookies_str = '从浏览器中copy过来的cookies字符串'

cookies_dict = {cookie.split('=')[0]:cookie.split('=')[-1] for cookie in cookies_str.split('; ')}

# 请求头参数字典中携带cookie字符串
resp = requests.get(url, headers=headers, cookies=cookies_dict)

print(resp.text)

Knowledge points: master the use of cookies parameters


 

3.5 Method of converting cookieJar object into cookies dictionary

The resposne object obtained using requests has the cookies attribute. The attribute value is a cookieJar type, which contains the cookie set locally by the other server. How do we convert it into a cookie dictionary?

  1. Conversion method

    cookies_dict = requests.utils.dict_from_cookiejar(response.cookies)

  2. Among them, response.cookies returns an object of type cookieJar

  3. requests.utils.dict_from_cookiejarThe function returns a dictionary of cookies


Knowledge points: master the conversion method of cookieJar

I received too much at once and couldn’t digest it~ Waiting for me~ To be continued...

Guess you like

Origin blog.csdn.net/weixin_45293202/article/details/114295370