The urllib library in the crawler uses

Before learning crawlers, we should first understand the Urllib library. In fact, it is a built-in HTTP request library in Python. To put it bluntly, it can be used directly without our additional installation. Normally, it includes the following four modules:

request: It is the most basic HTTP request module, which can be used to simulate sending requests. You only need to pass in the URL and additional parameters to the library method, and you can simulate the request process.

error: exception handling module, if there is a request error, we can catch these exceptions and then correct them.

parse: A tool module that provides many URL processing methods, such as splitting, parsing, merging, etc.

robotparser: It is mainly used to identify the robots.txt file of the website, and judge which websites can be crawled and which websites cannot be crawled. It is actually less practical.

How to crawl the web?

from urllib.request import urlopen

response = urlopen("http://www.baidu.com")
print(response.read().decode())

common method

requset.urlopen(url,data,timeout)

The first parameter url is the URL, the second parameter data is the data to be transmitted when accessing the URL, and the third timeout is to set the timeout period.
The second and third parameters can not be transmitted, data defaults to None, and timeout defaults to socket._GLOBAL_DEFAULT_TIMEOUT
The first parameter URL must be transmitted. In this example, we transmit the URL of Baidu. After executing the urlopen method, A response object is returned, and the returned information is stored in it.

response. read()

The read() method is to read all the contents of the file and return the bytes type

response.getcode()

Return the HTTP response code, successfully return 200, 4 server page error, 5 server problem

response.geturl()

Returns the actual URL that returns the actual data, preventing redirection issues

response. info()

Returns the HTTP headers of the server response

Request object

The request module is mainly responsible for constructing and initiating network requests, and adding Headers, Proxy, etc. to it.
It can be used to simulate the browser's request initiation process.

In fact, the above urlopen parameter can pass in a request request, which is actually an instance of the Request class, and Url, Data, etc. need to be passed in during construction. For example, the above two lines of code, we can rewrite it like this

from urllib.request import urlopen
from urllib.request import Request

request = Request("http://www.baidu.com")
response = urlopen(requst)
print (response.read().decode())

The running result is exactly the same, except that there is an extra request object in the middle. It is recommended that you write this way, because you need to add a lot of content when constructing a request. By constructing a request, the server responds to the request and gets a response, which is logically clear and clear

Requests sent through urllib will have a default Headers: "User-Agent": "Python-urllib/3.6", indicating that the request is sent by urllib. So when encountering some websites that verify User-Agent, we need to customize Headers to disguise ourselves.

insert image description here

When running a crawler, it often happens that the IP is blocked. At this time, we need to use the ip proxy to deal with it. The IP proxy settings of urllib are as follows

insert image description here

In the process of developing crawlers, the processing of cookies is very important. The processing of cookies in urllib is as follows:

insert image description here

Get request

Most of the html, images, js, css, ... that are transmitted to the browser are requested through the GET method. It is the main method to get data

For example: www.baidu.com search

The parameters of the Get request are all reflected in the Url. If there is Chinese, it needs to be transcoded, and we can use it at this time.

urllib.parse.urlencode()
urllib.parse. quote()
parse.urlencode()

When sending a request, it is often necessary to pass a lot of parameters. If you use the string method to splice it, it will be more troublesome. The parse.urlencode() method is used to splice url parameters.

insert image description here

operation result:

insert image description here

It can also be converted back to a dictionary via the parse.parse_qs() method.

insert image description here

parse.quote()

The url can only contain ascii characters. In the actual operation process, there will be a large number of special characters in the parameters passed by the get request through the url, such as Chinese characters, so url encoding is required.

For example https://www.baidu.com/s?wd=%E6%AF%9B%E5%88%A9

from  urllib import parse
url = 'https://www.baidu.com/s?wd={}'
save_url = url.format(parse.quote('毛利'))
print(save_url)
url = parse.unquote(save_url)
print(url)

https://www.baidu.com/s?wd=%E6%AF%9B%E5%88%A9
https://www.baidu.com/s?wd=毛利

We need to url encode the encoding

Post request

We said that there is a data parameter in the Request request object, which is used in POST. The data we want to transmit is this parameter data, and data is a dictionary, which must match key-value pairs

The meaning of sending request/response header:

insert image description here

encoding of the response

response status code

The response status code consists of three digits, the first digit defines the category of the response, and there are five possible values.

Common status codes:

insert image description here

Ajax request to get data

Some webpage content is loaded using AJAX, and AJAX generally returns JSON. If you directly post or get the AJAX address, JSON data will be returned.

Request SSL Certificate Verification

Websites starting with https can be seen everywhere now, urllib can request and verify SSL certificates for HTTPS, just like web browsers, if the SSL certificate of the website is certified by CA, it can be accessed normally, such as: https://www.baidu. com/.

# 忽略SSL安全认证
context = ssl._create_unverified_context()
# 添加到context参数里
response = urllib.request.urlopen(request, context = context)

Summarize

For the urllib library, I personally do not recommend using it, you only need to understand parse.urlencode() and parse.quote().

The urllib library in the crawler uses

Guess you like