Encapsulate the request. function twice to construct a general request function

Requests re-encapsulation, construct a universal request function


This chapter will tell you how to re-encapsulate the request module, but will not tell you about the HTTP protocol and principle, URL, etc. for the time being. Of course you will use it and then read this article and you will definitely get something else. I can't wait to tell you this little secret and want to communicate with you. No time to explain, come and discuss related content with me

The official document defines requests as: Requests is the only non-GMO Python HTTP library that humans can safely enjoy.

Friends who use Python to write crawlers must have used the requests module, and new crawlers must have written N repeated requests. This is your question. Of course, I have always been with me. Recently, I was thinking about how to encapsulate the requests so that it can support common functions. If you need to use it, just call it directly.

Then the question is coming, if you want to write a general request function for your own use, there will be several problems

  • The request method of requests (GET\POST\INPUT, etc.)
  • Intelligently identify the coding of the website to avoid garbled codes
  • Support text, binary (pictures, videos, etc. are binary content)
  • And I need to be a fool, that is the Ua of the website (Ua: User-Agent, basically the website will verify the Ua that receives the request. To initially judge whether it is a crawler or a user)

So let's tackle the above problems

Installation of Requests

After ensuring that the python environment is set up, directly use the pip or conda command to install, the installation command is as follows:

pip install requests
conda install requests

# 或者下载过慢点话,可以使用国内的pip镜像源,例如:
pip install requests -i  https://pypi.tuna.tsinghua.edu.cn/simple/

After the installation is complete, the renderings are as follows:

Probe into the basic use of requests

One of the most common requests in HTTP is the GET request. Let's take a closer look at how to build a GET request using the requests library.

import requests

response = requests.get('http://httpbin.org/get')

# 响应状态码
print("response.status_code:", response.status_code)
# 响应头
print("response.headers:", response.headers)
# 响应请求头
print("response.request.headers:", response.request.headers)
# 响应二进制内容
print("response.content:", response.content)
# 响应文本
print("response.text", response.text)

# 返回如下
response.status_code: 200
response.headers: {
    
    'Date': 'Thu, 12 Nov 2020 13:38:05 GMT', 'Content-Type': 'application/json', 'Content-Length': '306', 'Connection': 'keep-alive', 'Server': 'gunicorn/19.9.0', 'Access-Control-Allow-Origin': '*', 'Access-Control-Allow-Credentials': 'true'}
response.request.headers: {
    
    'User-Agent': 'python-requests/2.24.0', 'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*', 'Connection': 'keep-alive'}
response.content: b'{\n  "args": {}, \n  "headers": {\n    "Accept": "*/*", \n    "Accept-Encoding": "gzip, deflate", \n    "Host": "httpbin.org", \n    "User-Agent": "python-requests/2.24.0", \n    "X-Amzn-Trace-Id": "Root=1-5fad3abd-7516d60b3e951824687a50d8"\n  }, \n  "origin": "116.162.2.166", \n  "url": "http://httpbin.org/get"\n}\n'
{
    
    
  "args": {
    
    }, 
  "headers": {
    
    
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.24.0", 
    "X-Amzn-Trace-Id": "Root=1-5fad3abd-7516d60b3e951824687a50d8"
  }, 
  "origin": "116.162.2.166", 
  "url": "http://httpbin.org/get"
}

The basic use of requests has been simply tested. Is there a little feel? Next, we directly encapsulate it as a function for calling at any time

The example is as follows

import requests

urls = 'http://httpbin.org/get'


def downloader(url, headers=None):
    response = requests.get(url, headers=headers)
    return response


print("downloader.status_code:", downloader(url=urls).status_code)
print("downloader.headers:", downloader(url=urls).headers)
print("downloader.request.headers:", downloader(url=urls).request.headers)
print("downloader.content:", downloader(url=urls).content)
print("downloader.text", downloader(url=urls).text)

# 返回效果如上所示,此处省略

Above we have encapsulated the request method into a function. The basic URL and headers are exposed in the form of formal parameters. We only need to pass in the URL that needs to be requested to initiate the request. At this point, we have completed a simple and reusable request method.

End~~~

The above taking care of novices is basically completed, let's get some real guys next.

Secondary packaging

Encapsulation of request function

Since the request method is not necessarily (it may be GET or POST), we cannot 智能determine what method it sends the request.

Request method in Requests and help us realize this method. We exposed his request method, written as follows:

urls = 'http://httpbin.org/get'


def downloader(url, method=None, headers=None):
    _method = "GET" if not method else method
    response = requests.request(url, method=_method, headers=headers)
    return response


print("downloader.status_code:", downloader(url=urls).status_code)
print("downloader.headers:", downloader(url=urls).headers)
print("downloader.request.headers:", downloader(url=urls).request.headers)
print("downloader.content:", downloader(url=urls).content)
print("downloader.text", downloader(url=urls).text)

Since most of them are GET methods, we have defined a default request method. If you need to modify the request method, you only need to pass in the corresponding method when calling. For example, we can do

downloader(urls, method="POST")

Text encoding problem

Solve the problem of decoding errors caused by the error judgment of the request, and get garbled codes.

The cause of this error may be the response header, Accept-Encodingand the other is the recognition error

# 查看响应编码
response.encoding

At this point, we need to borrow cchardetthis package written in C language in Python to identify the encoding of the response text. Install it

pip install cchardet -i  https://pypi.tuna.tsinghua.edu.cn/simple/
# 如果pip直接安装失败的话,直接用清华源的镜像。
# 实现智能版的解码:如下
encoding = cchardet.detect(response.content)['encoding']



def downloader(url, method=None, headers=None):
    _method = "GET" if not method else method
    response = requests.request(url, method=_method, headers=headers)
    encoding = cchardet.detect(response.content)['encoding']
    return response.content.decode(encoding)

Distinguish between binary and text parsing

When downloading pictures, videos, etc., you need to obtain their binary content. And downloading webpage text needs to be encoded.

In the same way, we only need to pass in a sign to achieve the distinguishing effect. For example like this

def downloader(url, method=None, headers=None, binary=False):
    _method = "GET" if not method else method
    response = requests.request(url, method=_method, headers=headers)
    encoding = cchardet.detect(response.content)['encoding']
    return response.content if binary else response.content.decode(encoding)

Default Ua

In many cases, we take ua and copy it again. The key-value format is constructed by adding quotation marks. So sometimes just use requests to do a test. It was very troublesome. And if there were too many requests, the IP was blocked directly. Without my own ip proxy, I feel a little bit unable to play crawlers when I don’t have money.

In order to reduce the probability of being blocked IP, we add our own Ua pool. The principle of Ua pool is very simple, the internal is to use random Ua, thereby reducing the discovery 概率. As for why this effect can be achieved, I will only briefly introduce it here. The details may have to start with the principles of computer networks.

The conclusion is that most of your company uses the same 外网ip to access the target website. Then it means that there may be N people in your company using the same ip to visit the target website. The ban is generally distinguished by the frequency of ip access and the fingerprint of the browser and the ghosts together. Simply understand that the Ua+ip access frequency reaches its peak, and your IP is shut down by the other party.

Build your own ua pool to add the default request header,

There are many Ua, so I won't put it out here. If you are interested, you can go directly to the source code to get it. Directly talk about the principle: construct a lot of Ua, and then randomly take them. So as to reduce the frequency of the same access: At the same time, it also exposes the port for you to pass in the header yourself

from powerspider.tools.Ua import ua
import requests

def downloader(url, method=None, header=None, binary=False):
    _headers = header if header else {
    
    'User-Agent': ua()}
    _method = "GET" if not method else method
    response = requests.request(url, method=_method, headers=_headers)
    encoding = cchardet.detect(response.content)['encoding']
    return response.content if binary else response.content.decode(encoding)

The basic files have been resolved, but they are not perfect yet. Exception handling, error retry, nothing in the log. How can this work. Now that the work is done, it should be done beautifully.

Come let us join in these things

import cchardet
from retrying import retry
from powerspider import logger
from powerspider.tools.Ua import ua
from requests import request, RequestException


@retry(stop_max_attempt_number=3, retry_on_result=lambda x: x is None, wait_fixed=2000)
def downloader(url, method=None, header=None, timeout=None, binary=False, **kwargs):
    logger.info(f'Scraping {url}')
    _header = {
    
    'User-Agent': ua()}
    _maxTimeout = timeout if timeout else 5
    _headers = header if header else _header
    _method = "GET" if not method else method
    try:
        response = request(method=_method, url=url, headers=_headers, **kwargs)
        encoding = cchardet.detect(response.content)['encoding']
        if response.status_code == 200:
            return response.content if binary else response.content.decode(encoding)
        elif 200 < response.status_code < 400:
            logger.info(f"Redirect_URL: {response.url}")
        logger.error('Get invalid status code %s while scraping %s', response.status_code, url)
    except RequestException as e:
        logger.error(f'Error occurred while scraping {url}, Msg: {e}', exc_info=True)


if __name__ == '__main__':
    print(downloader("https://www.baidu.com/", "GET"))

At this point, our second package of Requests and the construction of a universal request function have been completed.

Source address: https://github.com/PowerSpider/PowerSpider/tree/dev
Public number: Accumulated Coder

Look forward to seeing you next time

Guess you like

Origin blog.csdn.net/wzp7081/article/details/109668531