Python - urllib library

        urllib is a built-in Python library for handling network requests.

1.Basic use

2. One type and 6 methods 

        2.1 A type

        The type returned by the urlopen method in urllib's request library: <class 'http.client.HTTPResponse'>. In order to distinguish it from the request library later.

        2.2 6 methods

  • read() method: Get the response text and read the text in bytes (read one byte by byte). What is obtained is binary data in byte form. The disadvantage is that it is too slow.

  • readline(): Read the response body line by line. Only one line can be read. The encoding returned is also binary.
  • readlines(): Loops to read the response body line by line. Until you finish reading. The encoding is binary.
  • getcode(): Get the status code.
  • geturl(): Returns the url address.
  • getheaders(): Get response headers.

3. Download resources

        Use the urllib.request.urlretrieve method

        Some people say that to download resources in this way, I can manually click on the web page to download. However, when we need to download thousands of resources, manual operations are time-consuming and labor-intensive. This is easily achieved through crawlers.

4. Customization of request objects

  • Discover the problem and its cause

        The requested URL is https://www.baidu.com, and the information returned in the response is incorrect. This is because of the anti-crawling method of https. This anti-crawling method will verify the Use-Agent field in the request header, which includes the operating system and version, CPU type, browser and version, browser rendering engine, browser language, and browser plug-ins. and other information.

  •  Solve the problem

First, let’s introduce the urlopen method and Request object:

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None) 
Open the URL url, which can be either a string or a Request object.
打开URL URL,它可以是字符串或请求对象。

class urllib.request.Request(url, data=None, headers={},
      origin_req_host=None, unverifiable=False, method=None)

         The urlopen parameter can receive a string or a Request object. We can customize the header by constructing the Request object. When calling urlopen, the request will contain the information we defined.

5. Codec

        Because computers were originally invented by Americans. At first, only 127 characters, lowercase letters, uppercase letters, numbers and some symbols were encoded into the computer. This encoding table is called the ASCII table.

        If you want a computer to understand Chinese, you cannot use the ASCII table. You have to develop your own coding standards, and it must not conflict with the ASCII code. So China formulated the GB2312 encoding and added Chinese to the computer encoding.

        But there are so many languages ​​in the world. Different languages ​​will need different encodings and different standards, and conflicts will inevitably occur. Conflicts are when a character has different values ​​in multiple standards, and the result is garbled characters on the display. .

        So Unicode came into being, unifying all languages ​​into a set of encodings. In this way, there will be no garbled code problems.

        example:

        When we search Jay Chou on Baidu.

url:https://www.baidu.com/s?wd=%E5%91%A8%E6%9D%B0%E4%BC%A6

 And the right side of the equal sign in wd=%E5%91%A8%E6%9D%B0%E4%BC%A6 is Jay Chou's Unicode encoding.

        The parameters of the get request method are in the URL, and the post request method is in the request body.

        5.1 quote method of get request method

        The method is in the urllib.parse.quote() library, and its function is to convert parameters into Unicode encoding.

Example: Let’s take the search for Jay Chou as an example:

        No encoding, just search directly.

        Search using quote encoding:

        The above phenomenon is caused by anti-crawling, because Baidu needs to verify user information. You need to add cookies to the header. Cookies can be obtained by right-clicking on the web page and inspecting:

         5.2 urlencode method of get request method

        The quato method can only convert one parameter to Unicode encoding at a time. When there are multiple parameters, the code will become complicated. The urllib.parse module also provides the urlencode method, which can convert multiple parameters into Unicode encoding and convert them into the corresponding format of the get parameters.  

        This method is also in the urllib.parse module and functions to convert multiple parameters into Unicode encoding. When accessing the url later, you only need to splice the parameters into the url.

        5.3 post request method

         The difference between the post request method and the get request method is that the parameters of the get request method are in the URL, and the parameters of the post request method are in the request body.

        When using the post request method to obtain data, the request parameters need to be passed in as parameters for constructing the urllib.request.Request object.

        The parameters can be obtained from the web page:

        5.4 Example - ajax get request for 10 pages of Douban movies

Introduction to ajax: Ajax is a new term proposed by Jesse James Garrett in 2005 to describe a 'new' approach using a collection of  existing technologies  , including  : HTML   Or  XHTML , CSS,  JavaScriptDOM , XML, XSLT, and most importantly XMLHttpRequest . Web applications using Ajax technology can quickly present incremental updates to the user interface without the need to reload (refresh) the entire page, which allows the program to respond to user operations faster.

Request the URL of Douban movie on the first page: https://movie.douban.com/j/chart/top_listtype=5&interval_id=100%3A90&action=&start=0&limit=20

Request the URL of Douban movie on the second page: https://movie.douban.com/j/chart/top_listtype=5&interval_id=100%3A90&action=&start=20&limit=20

Request the URL of Douban movie on the second page: https://movie.douban.com/j/chart/top_listtype=5&interval_id=100%3A90&action=&start=40&limit=20

...

We found that only the start parameter is changing in the URL requesting different page numbers, and the changing pattern is (page - 1) * 20

The code is:

       __name__: It is a built-in class attribute in Python and represents the name of the corresponding program. If the currently running program is the main program, __name__the value at this time is __main__, otherwise, it is the corresponding module name.

        5.5 Example - ajax post requesting KFC official website stores in a certain city

        The parameters of the post request are in the request body.

        Check the parameters in the example: when requesting different page numbers, the parameter pageindex value is the corresponding page number.

         Code:

6. Abnormality

        HTTP errors are error messages added when the browser cannot connect to the server. Guide users to see what problems are occurring on this page.

        Generally, there are two classes of exceptions: UrlError/HttpError: These two classes are under the urllib.error module.

  • URLError exception class: Generally, this exception is reported when there is a problem with the IP or parameters in the URL.

  • HTTPError exception class: Generally, there is an error in the path on the URL. HttpError is a subclass of UrlError. 

 

7. Use cookies to skip login

Phenomenon, when we crawl Weibo personal homepage:

 Check the code of Weibo login interface:

        This is an anti-crawling method. The accessed URL requires a login operation or jumps to the login interface. But the encoding of the login interface and the encoding of the access page are different. 

        There is a cookie and session mechanism in the http protocol, which allows users to log in to a website once and do not need to log in again when visiting the website.

        Find the cookie you need to access the web page and set it in the header to skip logging in.

The referer in the header: is also an anti-crawling method. Determine whether the previous webpage of the current webpage is the URL corresponding to the referer. If it is not equal, it cannot be entered. Under normal circumstances, pictures should be protected against hotlinking. 

8.handler processor

        The function of the handler is to customize more advanced request headers. Mainly used to handle dynamic cookie and proxy situations.

         8.1 Basic use of handler

        8.2 Agent

         8.3 Proxy pool

Guess you like

Origin blog.csdn.net/weixin_57023347/article/details/132780256