Follow An Xian to learn Python web crawler-using the requests module "two"

Continuing from the previous article , learn Python web crawler with An Xian-using the requests module "1"

table of Contents

3.6 The use of timeout parameter timeout

3.7 Understand the proxy and the use of proxy proxy parameters

3.8 Use the verify parameter to ignore the CA certificate

4. The requests module sends a post request

4.1 Request method to send post request

4.2 POST request exercise

5. Use requests.session to maintain state

5.1 The role of requests.session and application scenarios

5.2 How to use requests.session

5.3 Class test


3.6 The use of timeout parameter timeout

In the process of surfing the Internet, we often encounter network fluctuations. At this time, a request may still have no results after waiting for a long time.

In crawlers, if a request has no results for a long time, the efficiency of the entire project will become very low. At this time, we need to impose mandatory requirements on the request so that it must return the result within a specific time, otherwise it will report an error.

  1. How to use the timeout parameter timeout

    response = requests.get(url, timeout=3)

  2. timeout=3 means: after sending the request, a response will be returned within 3 seconds, otherwise an exception will be thrown

import requests


url = 'https://twitter.com'
response = requests.get(url, timeout=3)     # 设置超时时间

Knowledge point: master the use of the timeout parameter timeout


 

3.7 Understand the proxy and the use of proxy proxy parameters

The proxy proxy parameter specifies the proxy ip, so that the forward proxy server corresponding to the proxy ip forwards the request we send, so let's first understand the proxy ip and proxy server

3.7.1 Understanding the process of using a proxy

  1. The proxy ip is an ip, which points to a proxy server

  2. The proxy server can help us forward the request to the target server

3.7.2 The difference between forward proxy and reverse proxy

As mentioned earlier, the proxy ip specified by the proxy parameter points to the forward proxy server, then the corresponding reverse server; now let’s understand the difference between the forward proxy server and the reverse proxy server

  1. From the perspective of the party sending the request, distinguish between forward and reverse proxies

  2. Forwarding the request for the browser or the client (the party sending the request) is called a forward proxy

    • The browser knows the real IP address of the server that ultimately processes the request, such as VPN

  3. It is called a reverse proxy that does not forward the request for the browser or the client (the party sending the request), but for the server that finally processes the request.

    • The browser does not know the real address of the server, such as nginx

3.7.3 Classification of proxy ip (proxy server)

  1. According to the anonymity of the proxy IP, the proxy IP can be divided into the following three categories:

    • Transparent Proxy: Although the transparent proxy can directly "hide" your IP address, it can still find out who you are. The request header received by the target server is as follows:

      REMOTE_ADDR = Proxy IP
      HTTP_VIA = Proxy IP
      HTTP_X_FORWARDED_FOR = Your IP
    • Anonymous Proxy: With an anonymous proxy, others can only know that you use a proxy, but cannot know who you are. The request header received by the target server is as follows:

      REMOTE_ADDR = proxy IP
      HTTP_VIA = proxy IP
      HTTP_X_FORWARDED_FOR = proxy IP
    • High Anonymity Proxy (Elite proxy or High Anonymity Proxy): High Anonymity Proxy prevents others from discovering that you are using a proxy, so it is the best choice. There is no doubt that the use of high hidden proxy works best . The request header received by the target server is as follows:

      REMOTE_ADDR = Proxy IP
      HTTP_VIA = not determined
      HTTP_X_FORWARDED_FOR = not determined
  2. According to the different protocols used by the website, it is necessary to use the proxy service of the corresponding protocol. The protocol used from the proxy service request can be divided into:

    • http proxy: the target url is http protocol

    • https proxy: the target url is https protocol

    • socks tunnel proxy (such as socks5 proxy), etc.:

      1. The socks proxy simply transmits data packets and does not care about the application protocol (FTP, HTTP, HTTPS, etc.).

      2. Socks proxy is less time-consuming than http and https proxy.

      3. socks proxy can forward http and https requests

3.7.4 Use of proxies proxy parameters

In order to make the server think that it is not the same client requesting; in order to prevent frequent requests to a domain name from being blocked, so we need to use proxy ip; then we have to learn how the requests module uses proxy ip

  • usage:

    response = requests.get(url, proxies=proxies)
  • Proxies form: dictionary

  • E.g:

    proxies = { 
        "http": "http://12.34.56.79:9527", 
        "https": "https://12.34.56.79:9527", 
    }
  • Note: If the proxies dictionary contains multiple key-value pairs, the corresponding proxy ip will be selected according to the protocol of the url address when sending the request


Knowledge points: master the use of proxy ip parameters proxies

3.8 Use the verify parameter to ignore the CA certificate

When using a browser to go online, sometimes I can see the following prompt (12306 website before October 2018):

3.8.1 Run the code to view the effect of initiating a request to an insecure link in the code

Running the following code will throw ssl.CertificateError ...an exception containing the words

import requests
url = "https://sam.huat.edu.cn:8443/selfservice/"
response = requests.get(url)

3.8.2 Solution

In order to make normal requests in the code, we use verify=Falseparameters. At this time, the requests module sends requests without verification of the CA certificate: the verify parameter can ignore the verification of the CA certificate

import requests
url = "https://sam.huat.edu.cn:8443/selfservice/" 
response = requests.get(url,verify=False)

Knowledge points: master the use of verify parameters to ignore CA certificates


 

4. The requests module sends a post request

Thinking: Where do we use POST requests?

  1. Log in and register (In the eyes of web engineers, POST is more secure than GET, and the url address will not expose the user's account password and other information)

  2. When large text content needs to be transmitted (POST request does not require data length)

So in the same way, our crawler also needs to go back to simulate the browser to send a post request in these two places

4.1 Request method to send post request

  • response = requests.post(url, data)

  • dataParameters receive a dictionary

  • The other parameters of the requests module to send the post request function are exactly the same as those of the get request.

4.2 POST request exercise

Let's take a look at how to use the post request through the example of Jinshan translation:

  1. Address: http://fy.iciba.com/

Thinking analysis

  1. Capture packet to determine the requested url address

 

2. Determine the requested parameters

3. Determine the location of the returned data

  1. Simulate browser to get data

4.2.3 Conclusion of packet capture analysis

  1. URL address:http://fy.iciba.com/

  2. Request method: POST

  3. Request parameters:

    data = {
        'f':'auto', # means that the translated language is automatically recognized
        't':'auto', # means that the translated language is automatically recognized
        'w':'人生苦短' # Chinese string to be translated
    }
  4. pc端User-Agent:

    Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36

4.2.4 Code implementation

Understand how the requests module sends post requests, and after analyzing the Baidu translation on the mobile terminal, let’s complete the code

import requests
import json


class King(object):

    def __init__(self, word):
        self.url = "http://fy.iciba.com/ajax.php?a=fy"
        self.word = word
        self.headers = {
            "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
        }
        self.post_data = {
            "f": "auto",
            "t": "auto",
            "w": self.word
        }

    def get_data(self):
        response = requests.post(self.url, headers=self.headers, data=self.post_data)
        # 默认返回bytes类型,除非确定外部调用使用str才进行解码操作
        return response.content

    def parse_data(self, data):

        # 将json数据转换成python字典
        dict_data = json.loads(data)

        # 从字典中抽取翻译结果
        try:
            print(dict_data['content']['out'])
        except:
            print(dict_data['content']['word_mean'][0])

    def run(self):
        # url
        # headers
        # post——data
        # 发送请求
        data = self.get_data()
        # 解析
        self.parse_data(data)

if __name__ == '__main__':
    # king = King("人生苦短,及时行乐")
    king = King("China")
    king.run()
    # python标准库有很多有用的方法,每天看一个标准库的使用

Knowledge point: master the requests module to send post requests


 

5. Use requests.session to maintain state

The Session class in the requests module can automatically process the cookie generated in the process of sending the request to get the response, and then achieve the purpose of state maintenance. Next we will learn it

5.1 The role of requests.session and application scenarios

  • The role of requests.session

    • Automatic cookie process, i.e., the next request will take time before the cookie

  • Application scenarios of requests.session

    • Automatically process cookies generated during multiple consecutive requests

5.2 How to use requests.session

After the session instance requests a website, the local cookie set by the other's server will be saved in the session, and the next time the session is used to request the other's server, the previous cookie will be brought.

session = requests.session() # instantiate the session object
response = session.get(url, headers, ...)
response = session.post(url, data, ...)
  • The parameters of the get or post request sent by the session object are exactly the same as the parameters sent by the requests module

5.3 Class test

Use requests.session to complete github login, and get the pages that need to be logged in to access

5.3.1 Tips

  1. Capture the entire completion process of github login and access to pages that can only be accessed after login

  2. Determine the url address, request method and required request parameters of the login request

    • Part of the request parameters in the response content corresponding to other URLs can be obtained using the re module

  3. Determine the URL address and request method of the page that can be accessed after logging in

  4. Use requests.session to complete the code

5.3.2 Reference code

import requests
import re


# 构造请求头字典
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',
}

# 实例化session对象
session = requests.session()

# 访问登陆页获取登陆请求所需参数
response = session.get('https://github.com/login', headers=headers)
authenticity_token = re.search('name="authenticity_token" value="(.*?)" />', response.text).group(1) # 使用正则获取登陆请求所需参数

# 构造登陆请求参数字典
data = {
    'commit': 'Sign in', # 固定值
    'utf8': '✓', # 固定值
    'authenticity_token': authenticity_token, # 该参数在登陆页的响应内容中
    'login': input('输入github账号:'),
    'password': input('输入github账号:')
}

# 发送登陆请求(无需关注本次请求的响应)
session.post('https://github.com/session', headers=headers, data=data)

# 打印需要登陆后才能访问的页面
response = session.get('https://github.com/1596930226', headers=headers)
print(response.text)

Knowledge point: master the use of requests.session to maintain state

Guess you like

Origin blog.csdn.net/weixin_45293202/article/details/114402835