From entry to abandon-python crawler series: the use of requests library

Use requests

Now that it's here, let's swim in the ocean of knowledge!
Insert picture description here

1. Introduction to requests

The most important thing to get a webpage is how to simulate a browser to send a request to the server , and the third-party library requests library provides us with a full-featured processing method, let us take a look at the power of the requests library!

2. Install the requests library

  • liunx users can directly use the command line mode: pip3 install requests
  • Window10 users can also use the command line mode: pip install requests or download via pycharm (the tutorial link is here ).

3. Request method of requests library

3.1 requests.get(url,params,headers)

Construct a Request object to the server and return a Response object . This is the most common method, be sure to skilled master. The main attributes of response are as follows:

  • text: the string form of the HTTP response content, that is, the page content corresponding to the url
  • content: the binary form of the HTTP response content
  • encoding: The encoding of the response content guessed from the HTTP header
  • status_code: the return status of the HTTP request, 200 means connection is successful, 404 means failure
  • apparent_ encoding: the response content encoding method analyzed from the content (alternative encoding method)
  • history: request history
  • headers: response header information
  • cookies: Cookies recorded in the response
import requests


response = requests.get("http://www.baidu.com")

# 响应的内容文本
print(response.text)
print("+"*70)

# 响应的内容(二进制形式)
print(response.content)
print("+"*70)

# 从响应头部中得到的编码方式
print(response.encoding)
print("+"*70)

# 响应体中得到的编码方式
print(response.apparent_encoding)
print("+"*70)

# 状态码,用查看请求状态
print(response.status_code)  
print("+"*70)

# 响应的头部
print(response.headers)  
print("+"*70)

# 响应cookies
print(response.cookies) 
print("+"*70)

# url
print(response.url)  
print("+"*70)

# 请求历史
print(response.history)  


Insert picture description here
The parameter url refers to the URL of the web page that needs to be crawled; the parameter headers refers to the header information that needs to be passed , such as User-Agent, cookies, etc.; params is the parameter that needs to be passed , the sample code is as follows:

import requests

url1 ="http://www.baidu.com"

headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36"
}
response1 = requests.get(url1,headers=headers)
print(response1.content)
print(response1.text)  # 注意区分两者区别

# 抓取二进制数据,如图片,视频等
url2 = "http://github.com/favicon.ico"
response2 = requests.get(url2,headers=headers)
p = open("github.ico","wb")
p.write(response2.content)
print("图片下载完成")
p.close()

The downloaded pictures are as follows:
Insert picture description here

3.2 requests.post(url,data,file)

The method for submitting a POST request to an HTML page . The parameter data is the submitted data (usually a dictionary or json type); file is the uploaded file.

import requests

url='http://www.httpbin.org/post'
data = {
    
    
    'name':'xiaowang',
    'age':'16'
}
response = requests.post(url,data=data)
print(response.text)

The running result is as follows, it can be seen that the submitted data is stored in the form .
Insert picture description here

3.3 requests.put()

A method to submit a PUT request to an HTML page .

3.4 requests.head()

A method to get the header information of an HTML page .

3.5 requests.patch()

Submit a partial modification request to the HTML page .

3.6 requests.delete()

Submit a delete request to the HTML page .
Insert picture description here

4. Exception of requests library

  • requests.HTTPError: HTTP web page status is abnormal , the status code is 200 under normal circumstances
  • requests.Timeout: request timeout exception (the entire time period from initiating the URL request to obtaining the content)
  • requests.URLRequired: URL missing exception
  • requests.ConnectTimeout: only refers to the timeout exception when connecting to the remote server
  • requests.ConnectionError: abnormal network connection errors , such as DNS query failure, connection refused, etc.
  • requests.TooManyRedirects: The maximum number of redirects is exceeded, and a redirect exception occurs .
    Insert picture description here

5. Maintain the session

5.1 Why should I maintain a session?

Under normal circumstances, if you directly use request methods such as get to request a web page multiple times, the other party will not recognize that this is a session, that is, every request is a new session . When you need to log in, you need to crawl its website information multiple times, and then an error will be generated (the first request is to log in, but the second request is not to log in). At this time, maintaining the session plays a corresponding role. The session object provided by the requests library can help us easily maintain a session without worrying about other issues. Let's take a look at the role of session!

import requests


# 普通的多次的登录并不会保存相同的cookies
r1 = requests.get("http://www.httpbin.org/cookies/set/number/11111111111")
r2 = requests.get("http://www.httpbin.org/cookies")
print(r2.text)
print("*"*30)

s = requests.session()  # 建立会话对象
s.get("http://www.httpbin.org/cookies/set/number/2222222222")   # 通过会话对象进行请求操作,设置cookies
response = s.get("http://www.httpbin.org/cookies")   # 同一会话再次请求,查询cookies
print(response.text)

The results are as follows. It can be found that under normal circumstances, session r1 sets the cookies for this session, but the cookies are no longer the same when requested again, indicating that it is no longer the same session ; the two cookies in the session state are the same , indicating that this is the same session. Therefore, session is often used for operations with simulated login .
Insert picture description here

6. SSL certificate verification

Some encrypted websites need to be trusted by the official CA organization . If they are not certified, the result of certificate verification errors will occur .
Insert picture description here
When crawling, if we directly request the website, the exception SSLError will be thrown , indicating a certificate verification error. Don't worry, the requests library has already thought of a solution for us. It provides a parameter verify , which is True by default (not checking the certificate) , we only need to set it to False. But at this time we will receive a warning , and we can deal with it by setting to ignore the warning . details as follows:

import requests
from requests.packages import urllib3

urllib3.disable_warnings()   # 忽视警告
response = requests.get("https://www.12306.cn")
print(response.status_code)

At this time, crawl the website again and you can visit it normally.
Insert picture description here

7. Proxy settings

7.1 What is a proxy?

In the actual crawling process, the server will detect the access frequency of a certain ip. Because the crawler crawling frequency is too fast , some websites will show a verification prompt , or directly block the ip , or return the wrong data information quietly . To solve this problem, we need to disguise our ip , and then we need a proxy. The proxy is actually a proxy server , and its function is to proxy users to obtain network information. Visually speaking, it is an agent. Use agents to crawl data for us, so that we can solve the problem.
Insert picture description here

7.2 Set up proxy

The powerful requests set up a parameter proxies for us to set up the proxy. The parameter proxies is a dictionary, the key is the protocol, and the key is the ip and port. The details are as follows (the agent here is not real):

import requests

proxies = {
    
    
    "http":"127.0.0.1:1314",
    "https":"127.0.0.1:2345"
}
response = requests.get(url,proxies)
print(response.text)

If the proxy server requires a username and password , you can set it as follows:

proxies = {
    
    
    "http":"username:[email protected]:1314",
    "https":"username:[email protected]:2345"

8. Timeout setting

It takes a certain amount of time for us to request the parent server and get its response. We sometimes wait for a particularly long time. Therefore, in order to prevent the server from receiving a response in time , the requests library provides a timeout parameter t imeout . If no response is received within the specified time, an exception ConnectTimeout will be thrown. So we can deal with it by setting the catch exception .

import requests

response = requests.get("http://www.baidu.com",timeout=0.01)
print(response)

Insert picture description here

9. Identity Verification

When accessing the Internet, we may face authentication issues. At this time, an exception occurs when logging in directly. The auth parameter of the requests library provides us with a login method. auth is a tuple, element 1 is the username and element 2 is the password.

import requests

auth = ("username","password")  # 元组,用户名,密码
response = requests.get("http://www.baidu.com",auth=auth) # 此url请更换
print(response.status_code)

10. Request method

	我们可以将请求表示为**数据结构**,将各个参数都通过一个request来表示,这样我们就可以将请求当做**独立的对象**来看待,这在后续进行**队列调度**时会非常方便。如下实例:
import requests


url = "http://www.httpbin.org/post"
data={
    
    
    "name":"xiaoming",
    "age":"18"
}
headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36"
}

s = requests.Session()
request = requests.Request("post",url,data=data,headers=headers)

prepared = s.prepare_request(request)
response = s.send(prepared)
print(response.text)

Starting from today, articles about crawlers will be published from time to time, and the content is slightly insufficient. I would like to ask everyone for your advice, forgive me!

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_45807032/article/details/106105373