Continuing from the previous article , learn Python web crawler with An Xian-using the requests module "1"
table of Contents
3.6 The use of timeout parameter timeout
3.7 Understand the proxy and the use of proxy proxy parameters
3.8 Use the verify parameter to ignore the CA certificate
4. The requests module sends a post request
4.1 Request method to send post request
5. Use requests.session to maintain state
5.1 The role of requests.session and application scenarios
5.2 How to use requests.session
3.6 The use of timeout parameter timeout
In the process of surfing the Internet, we often encounter network fluctuations. At this time, a request may still have no results after waiting for a long time.
In crawlers, if a request has no results for a long time, the efficiency of the entire project will become very low. At this time, we need to impose mandatory requirements on the request so that it must return the result within a specific time, otherwise it will report an error.
-
How to use the timeout parameter timeout
response = requests.get(url, timeout=3)
-
timeout=3 means: after sending the request, a response will be returned within 3 seconds, otherwise an exception will be thrown
import requests
url = 'https://twitter.com'
response = requests.get(url, timeout=3) # 设置超时时间
Knowledge point: master the use of the timeout parameter timeout
3.7 Understand the proxy and the use of proxy proxy parameters
The proxy proxy parameter specifies the proxy ip, so that the forward proxy server corresponding to the proxy ip forwards the request we send, so let's first understand the proxy ip and proxy server
3.7.1 Understanding the process of using a proxy
-
The proxy ip is an ip, which points to a proxy server
-
The proxy server can help us forward the request to the target server
3.7.2 The difference between forward proxy and reverse proxy
As mentioned earlier, the proxy ip specified by the proxy parameter points to the forward proxy server, then the corresponding reverse server; now let’s understand the difference between the forward proxy server and the reverse proxy server
-
From the perspective of the party sending the request, distinguish between forward and reverse proxies
-
Forwarding the request for the browser or the client (the party sending the request) is called a forward proxy
-
The browser knows the real IP address of the server that ultimately processes the request, such as VPN
-
-
It is called a reverse proxy that does not forward the request for the browser or the client (the party sending the request), but for the server that finally processes the request.
-
The browser does not know the real address of the server, such as nginx
-
3.7.3 Classification of proxy ip (proxy server)
-
According to the anonymity of the proxy IP, the proxy IP can be divided into the following three categories:
-
Transparent Proxy: Although the transparent proxy can directly "hide" your IP address, it can still find out who you are. The request header received by the target server is as follows:
REMOTE_ADDR = Proxy IP HTTP_VIA = Proxy IP HTTP_X_FORWARDED_FOR = Your IP
-
Anonymous Proxy: With an anonymous proxy, others can only know that you use a proxy, but cannot know who you are. The request header received by the target server is as follows:
REMOTE_ADDR = proxy IP HTTP_VIA = proxy IP HTTP_X_FORWARDED_FOR = proxy IP
-
High Anonymity Proxy (Elite proxy or High Anonymity Proxy): High Anonymity Proxy prevents others from discovering that you are using a proxy, so it is the best choice. There is no doubt that the use of high hidden proxy works best . The request header received by the target server is as follows:
REMOTE_ADDR = Proxy IP HTTP_VIA = not determined HTTP_X_FORWARDED_FOR = not determined
-
-
According to the different protocols used by the website, it is necessary to use the proxy service of the corresponding protocol. The protocol used from the proxy service request can be divided into:
-
http proxy: the target url is http protocol
-
https proxy: the target url is https protocol
-
socks tunnel proxy (such as socks5 proxy), etc.:
-
The socks proxy simply transmits data packets and does not care about the application protocol (FTP, HTTP, HTTPS, etc.).
-
Socks proxy is less time-consuming than http and https proxy.
-
socks proxy can forward http and https requests
-
-
3.7.4 Use of proxies proxy parameters
In order to make the server think that it is not the same client requesting; in order to prevent frequent requests to a domain name from being blocked, so we need to use proxy ip; then we have to learn how the requests module uses proxy ip
-
usage:
response = requests.get(url, proxies=proxies)
-
Proxies form: dictionary
-
E.g:
proxies = { "http": "http://12.34.56.79:9527", "https": "https://12.34.56.79:9527", }
-
Note: If the proxies dictionary contains multiple key-value pairs, the corresponding proxy ip will be selected according to the protocol of the url address when sending the request
Knowledge points: master the use of proxy ip parameters proxies
3.8 Use the verify parameter to ignore the CA certificate
When using a browser to go online, sometimes I can see the following prompt (12306 website before October 2018):
-
Reason: The CA certificate of this website has not been authenticated by [Trusted Root Certification Authority]
-
Click to learn more about CA certificates and trusted root certification authorities , we will not expand in class
3.8.1 Run the code to view the effect of initiating a request to an insecure link in the code
Running the following code will throw
ssl.CertificateError ...
an exception containing the words
import requests url = "https://sam.huat.edu.cn:8443/selfservice/" response = requests.get(url)
3.8.2 Solution
In order to make normal requests in the code, we use
verify=False
parameters. At this time, the requests module sends requests without verification of the CA certificate: the verify parameter can ignore the verification of the CA certificate
import requests url = "https://sam.huat.edu.cn:8443/selfservice/" response = requests.get(url,verify=False)
Knowledge points: master the use of verify parameters to ignore CA certificates
4. The requests module sends a post request
Thinking: Where do we use POST requests?
Log in and register (In the eyes of web engineers, POST is more secure than GET, and the url address will not expose the user's account password and other information)
When large text content needs to be transmitted (POST request does not require data length)
So in the same way, our crawler also needs to go back to simulate the browser to send a post request in these two places
4.1 Request method to send post request
-
response = requests.post(url, data)
-
data
Parameters receive a dictionary -
The other parameters of the requests module to send the post request function are exactly the same as those of the get request.
4.2 POST request exercise
Let's take a look at how to use the post request through the example of Jinshan translation:
-
Address: http://fy.iciba.com/
Thinking analysis
-
Capture packet to determine the requested url address
2. Determine the requested parameters
3. Determine the location of the returned data
-
Simulate browser to get data
4.2.3 Conclusion of packet capture analysis
-
URL address:
http://fy.iciba.com/
-
Request method: POST
-
Request parameters:
data = { 'f':'auto', # means that the translated language is automatically recognized 't':'auto', # means that the translated language is automatically recognized 'w':'人生苦短' # Chinese string to be translated }
-
pc端User-Agent:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36
4.2.4 Code implementation
Understand how the requests module sends post requests, and after analyzing the Baidu translation on the mobile terminal, let’s complete the code
import requests
import json
class King(object):
def __init__(self, word):
self.url = "http://fy.iciba.com/ajax.php?a=fy"
self.word = word
self.headers = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
}
self.post_data = {
"f": "auto",
"t": "auto",
"w": self.word
}
def get_data(self):
response = requests.post(self.url, headers=self.headers, data=self.post_data)
# 默认返回bytes类型,除非确定外部调用使用str才进行解码操作
return response.content
def parse_data(self, data):
# 将json数据转换成python字典
dict_data = json.loads(data)
# 从字典中抽取翻译结果
try:
print(dict_data['content']['out'])
except:
print(dict_data['content']['word_mean'][0])
def run(self):
# url
# headers
# post——data
# 发送请求
data = self.get_data()
# 解析
self.parse_data(data)
if __name__ == '__main__':
# king = King("人生苦短,及时行乐")
king = King("China")
king.run()
# python标准库有很多有用的方法,每天看一个标准库的使用
Knowledge point: master the requests module to send post requests
5. Use requests.session to maintain state
The Session class in the requests module can automatically process the cookie generated in the process of sending the request to get the response, and then achieve the purpose of state maintenance. Next we will learn it
5.1 The role of requests.session and application scenarios
-
The role of requests.session
-
Automatic cookie process, i.e., the next request will take time before the cookie
-
-
Application scenarios of requests.session
-
Automatically process cookies generated during multiple consecutive requests
-
5.2 How to use requests.session
After the session instance requests a website, the local cookie set by the other's server will be saved in the session, and the next time the session is used to request the other's server, the previous cookie will be brought.
session = requests.session() # instantiate the session object response = session.get(url, headers, ...) response = session.post(url, data, ...)
-
The parameters of the get or post request sent by the session object are exactly the same as the parameters sent by the requests module
5.3 Class test
Use requests.session to complete github login, and get the pages that need to be logged in to access
5.3.1 Tips
-
Capture the entire completion process of github login and access to pages that can only be accessed after login
-
Determine the url address, request method and required request parameters of the login request
-
Part of the request parameters in the response content corresponding to other URLs can be obtained using the re module
-
-
Determine the URL address and request method of the page that can be accessed after logging in
-
Use requests.session to complete the code
5.3.2 Reference code
import requests
import re
# 构造请求头字典
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',
}
# 实例化session对象
session = requests.session()
# 访问登陆页获取登陆请求所需参数
response = session.get('https://github.com/login', headers=headers)
authenticity_token = re.search('name="authenticity_token" value="(.*?)" />', response.text).group(1) # 使用正则获取登陆请求所需参数
# 构造登陆请求参数字典
data = {
'commit': 'Sign in', # 固定值
'utf8': '✓', # 固定值
'authenticity_token': authenticity_token, # 该参数在登陆页的响应内容中
'login': input('输入github账号:'),
'password': input('输入github账号:')
}
# 发送登陆请求(无需关注本次请求的响应)
session.post('https://github.com/session', headers=headers, data=data)
# 打印需要登陆后才能访问的页面
response = session.get('https://github.com/1596930226', headers=headers)
print(response.text)
Knowledge point: master the use of requests.session to maintain state