python ---->> Use urllib library to obtain network resources

My personal blog homepage: If '' can really escape 1️⃣ say 1️⃣ blog homepage
(1) about Python basic grammar learning ----> you can refer to my blog "I learn Python in VScode"
(2) pip is It is necessary to learn the importance of Python ---->> PiP in the process of learning the language of python

The Python urllib library is used to manipulate webpage URLs and crawl the content of webpages.


urllib is a module of Python that provides a series of modules for working with URLs. It can be used for tasks such as sending HTTP requests1 , handling cookies, and handling web sockets.

what is url

insert image description here
URL (foreign name: Uniform Resource Locator, Chinese name: Uniform Resource Locator), Uniform Resource Locator is a concise representation of the location and access method of resources that can be obtained from the Internet, and is the address of standard resources on the Internet . Every file on the Internet has a unique URL, which contains information indicating where the file is located and what the browser should do with it. It was originally invented by Tim Berners-Lee as an address for the World Wide Web.

Mainly included modules:

insert image description here

  • urllib.request (request module): This module defines functions and classes that help open URLs (mostly HTTP) in the complex world - basic and digest authentication, redirects, cookies, etc.

  • urllib.error (exception handling module): This module defines the exception classes raised by urllib.request.

  • urllib.parse (url parsing module): This module defines functions for parsing a URL into its constituent parts, and functions for combining URL components into a complete URL.

  • urllib.robotparser (robots.txt parsing module): This module provides a single class RobotFileParser for answering whether a particular user agent is allowed to fetch URLs on sites that publish robots.txt files.

[1] urllib.request (request module) —> an extensible library for opening URLs

(1)urllib.request.urlopen

The main job of the urllib.request.urlopen() function is to open the specified URL and retrieve the response.

function prototype

The urlopen() method not only supports the HTTP protocol, but also supports other common protocols,

  • Such as FTP, SMTP, and HTTPS.
  • Telnet
  • File / URL
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
- url:要打开的URL。
- data:这是一个可选参数,用于指定要发送到服务器的数据。
	如果不提供此参数,则请求将是GET请求。如果提供了它,则请求将是POST请求。
- timeout:这是一个可选参数,用于指定请求的超时时间(以秒为单位)。
	如果服务器在此时间内没有响应,则会引发timeout异常。
- cafile:这是一个可选参数,用于指定包含用于SSL验证的CA证书的文件的路径。
- capath:这是一个可选参数,用于指定包含用于SSL验证的CA证书的目录的路径。
- cadefault:这是一个可选参数,用于指定是否使用系统的默认CA证书进行SSL验证。
- context:这是一个可选参数,用于指定用于请求的SSL上下文。

example:

import urllib.request

response = urllib.request.urlopen('**你的网站**')
html = response.read()
print(html)

The urllib.request.urlopen function is used to open the URL you need to know about the website .
Then, use the read method to read the contents of the URL into the variable html. (The return value of the response page obtained by the method. Read the content in the opened URL and return the byte string (bytes).)
Finally, use the print function to print the content of the html to the console.

(2) urllib.request.urlretrieve —> This function retrieves the content of the specified URL and saves it to a local file.

urllib.request.urlretrieve(url, [filename=None, ]reporthook=None, data=None)
  • url: The URL to retrieve.

  • filename: The local filename to save the content to. If not specified, the function will try to guess the filename from the URL and save it in the current working directory.

  • reporthook: optional parameter, used to specify a callback function, which will be called during the retrieval process to display progress information.

example

If a filename parameter is passed, and the URL points to a local resource, the result is copied from the local file to the new file.

The urllib.request.urlretrieve function returns a tuple containing the path to the newly created data file and the resulting HTTPMessage object.
insert image description here
insert image description here

【2】- urllib.error (exception handling module): This module defines the exception class raised by urllib.request.

urllib.error.URLError(reason, *args):This is the base class for all exceptions that occur during URL processing. This exception is thrown when a URL cannot be opened or retrieved. The reason parameter is a string describing the reason for the error. The args parameter is a tuple containing additional error-related information.
urllib.error.HTTPError(url, code, msg, hdrs, fp):This is a subclass of URLError and is raised when an HTTP error occurs. It contains information about HTTP error codes and causes. The url parameter is the URL that raised the error, the code parameter is the HTTP error code, the msg parameter is the HTTP error reason phrase, the hdrs parameter is the HTTP response headers, and the fp parameter is a file object for reading the body of the error response.
urllib.error.ContentTooShortError(msg, content): This is a subclass of URLError raised when the downloaded content is shorter than expected. The msg parameter is a string describing the cause of the error. The content parameter is a byte string containing the downloaded content.

# 导入urllib.error模块:
import urllib.error

[3] urllib.parse (url parsing module): This module defines a function that parses a URL into its components, and a function that combines URL components into a complete URL.

- urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)-> ParseResult: Parses a URL into its components and returns a ParseResult object containing the parsed result.
- urllib.parse.urlunparse(components)-> str: Accepts a tuple of URL components and returns the full URL string.
- urllib.parse.urlencode(query, doseq=False, safe='', encoding=None, errors=None, quote_via=quote_plus)-> str: Accepts a dictionary of query parameters and returns a URL-encoded string.
- urllib.parse.parse_qs(qs, keep_blank_values=False, strict_parsing=False, encoding='utf-8', errors='replace', max_num_fields=None)-> Dict[str, Union[str, List[str]]]: Accepts a URL-encoded query string and returns a dictionary of query parameters.

[4] urllib.robotparser (robots.txt parsing module): This module provides a single class RobotFileParser, which is used to answer whether a specific user agent is allowed to obtain URLs on websites that publish robots.txt files.

- urllib.robotparser.RobotFileParser(url=''):Creates a new RobotFileParser object that can parse robots.txt files for the specified URL.
- urllib.robotparser.set_url(url):Sets the URL of the robots.txt file to parse.
- urllib.robotparser.read():Read the robots.txt file from the specified URL and parse its content.
- urllib.robotparser.can_fetch(useragent, url):Checks whether the specified user agent can access the specified URL. Return True if it can; otherwise return False.
- urllib.robotparser.mtime():Returns the timestamp of the last time the robots.txt file was read.
- urllib.robotparser.modified():Returns whether the robots.txt file has been modified.

Use urllib library to get network resources

Of course, obtaining the desired network resources is the first priority for us to study this article.
You can refer to this blog: HTML that crawlers want


  1. An HTTP request refers to the process in which a client sends a request to a server and receives a response from the server. ↩︎

Guess you like

Origin blog.csdn.net/m0_74154295/article/details/130497703