urllib basic library of Python web crawler

In Python3 in, urllib and urllib2 two libraries into one library, unified urllib library.

urllib library is built Python HTTP request library, comprising the following four main modules:

  • request: the basic, the HTTP request is the most important module for analog transmission request.
  • error: exception handling module.
  • parse: a tool module, a URL many processing method, resolution, resolution, and the like combined.
  • robotparser: used to identify the site robot.txt file to determine the site can you climb

1.request module: sending a request

Request sending module may be realized and obtained in response to the request. Mainly in the following methods:

1.urlopen ()

With basic the urlopen () method can be accomplished basic page GET request. For example, with Baidu, for example:

import urllib.request
response=urllib.request.urlopen("http://www.baid.com")
print(response.read().decode("utf-8"))

Run the above piece of code to get the following results:

So, we get to the response variable what is it, print what he can see is a type of "<class 'http.client.HTTPResponse'>" object class, the main attributes msg, version, status, reason, etc. The main methods are read (), readinto (), getheader (name), getheaders () and the like. Get this object we can call these methods and properties of.

For example, in the above said Baidu website example we were a few instances where the call as follows:

response class object properties

If you want to link to pass some parameters how to operate it, first of all look at the document in the interpretation of urlopen () function:

*urllib.request.urlopen(url, data=None, timeout=<object object at 0x000000000049D760>, , cafile=None, capath=None, cadefault=False, context=None)

We can see, there are several other parameters in addition to url. We introduced one by one.

  • data parameters

Optional parameter, but if you want to add the parameter bytes need () method is converted into byte stream format of the content. Add this parameter from the GET request method becomes POST. For example, data = bytes (urllib.parse.urlencode ({ 'world': 'hello'}), encoding = 'utf-8')

  • timeout parameter

Timeout setting, such as setting the timeout = 1, if the request this time has not been exceeded in response to an exception is thrown.

2.Request()

urlopen () is the most basic request of a library, you can get the most simple content, but can not do anything more complex requests, such as adding headers content, we need to build a more powerful Request method. Look at the use of Request:

import urllib.request

request = urllib.request.Request("https://www.baidu.com")
response = urllib.request.urlopen(request)
print(response.read().decode("utf-8"))

In the above code, although the same method used is urlopen transmission request, but not the parameter is a URL, but an object request class. Use Request We can be more flexible and more convenient in which the configuration parameters.
Request the API interface look like:

urllib.request.Request(url, data=None, headers={}, origin_req_host=None, unverifiable=False, method=None)

  • url: Required parameter, the request URL
  • Data: bytes byte stream type
  • headers: dictionary browser disguised as generally used, in the absence of this parameter, the server recognizes that you visit with Python, we will judge you pull the black reptile, an anti reptiles
  • origin_req_host: requesting party host or IP
  • unverifiable: Indicates whether the request is unverifiable, default False. Meaning that the user has no rights to receive requested content.
  • method: the request using methods such as GET, POST, PUT, etc.

2.error module: handling exceptions

in urllib module defines the error exception is generated by the request module, if abnormal, the response request module throws an exception.

1.URLError

urllib error from the module, OSError inherited from class, the base class module error exception, the exception processing can be thereby triggered by the request module. There is a property reason, that is the wrong reasons.

2.HTTPError

URLError subclass, specialized error handling HTTP requests, such as request fails, it has three properties:

  • code: returns an HTTP status code, such as the common 404 page does not exist
  • reason: Returns the cause of the error
  • headers: returns the requested head

Here is a more common exception handling code is written:

from urllib import request, error

try:
    respons = request.urlopen("www.aaaa.com")
except error.HTTPError as e:
    print(e.reason, e.code, e.headers)
except error.HTTPError as e:
    print(e.reason)
else:
    print("Request Successfully")

Because HTTPError is a subclass of URLError, so that you can first capture HTTPError, acquired his status, if not, then go to capture URLError error, the output of reasons. Finally, processing logic in normal else.

3.parse module: Parsing the Link

parse module defines the URL of the standard interface, extracting, combining and converting the various parts of the link URL.

1.urlparse ()

Look at an example, run the following code:

from urllib.parse import urlparse

result = urlparse("https://voice.baidu.com/act/newpneumonia/newpneumonia/?from=osari_pc_1")
print(type(result),"\n", result)

To achieve the following results:

<class 'urllib.parse.ParseResult'>
ParseResult(scheme='https', netloc='voice.baidu.com', path='/act/newpneumonia/newpneumonia/', params='', query='from=osari_pc_1', fragment='')

As can be seen, the result is an object of a class Parse, comprising scheme, netloc, path, params, query, fragment six parts. As can be seen compared with the original URL link to a standard URL format is as follows:

scheme://netloc/path;params?query#fragment

Among them, "//" in front scheme on behalf of the agreement; the first "/" symbol in front of a netloc, namely domain name, followed by the path, that is, the access path; ";" followed by params, on behalf of the parameter; "?" Followed by a query , of the type generally used GET URL; "#" followed by the anchor for positioning the pull-down position inside the page within the web page. The URL will meet this general rule, use urlparse () they can be split open.

urllib.parse.ParseResult an ancestral type, you can be operated by a list.

2.urlunparse ()

Namely urlparse () the inverse operation.

3.urlsplit ()

And the urlparse () is similar params not resolve separate part, into the path, i.e. a length of 5

4.urljoin ()

urljoin("base_url","new_url")

The two base_url and new_url url spliced ​​together, the principle is base_url only three elements: scheme, netloc, path. If this link does not exist in three years, base_url is used in the automatic replenishment, Ruoguo exist, new_url in the corresponding content is used. base_url inside params, query, fragment can not afford Rehe role.

5.urlencoden()

Directly on usage:

from urllib.parse import urlencode

params = {
    'name':'Merry',
    'age':18
}
base_url = "https://www.baidu.com?"
url = base_url + urlencode(params)
print(url)

The output is: https://www.baidu.com?name=Merry&age=18  encode () function is the first parameter dictionary into a sequence parameter GET request.

6.quote()

from urllib.parse import quote

key = "你好"
url = "www.baidu.com/s?wd=" + quote(key)
print(url)

www.baidu.com/s?wd=%E4%BD%A0%E5%A5%BD

This function will be content to spend makeup URL-encoded format, this function solve the garbage problem with Chinese URL parameters that may be caused. Which is the reverse operation unquote ()

In addition, there urlsplit (), parse_qs (), parse_qsl () function and the like, not described in detail.

Guess you like

Origin www.cnblogs.com/shuai3290/p/12563605.html