urllib.request --- Extensible library for opening URLs

源码: Lib/urllib/request.py


The urllib.request module defines functions and classes suitable for opening URLs (mainly HTTP) in various complex situations --- such as basic authentication, digest authentication, heavy authentication, etc. Targeting, cookies and more.

See

For higher-level HTTP client interfaces, it is recommended to use Requests .

Availability: Not Emscripten, not WASI.

This module is not applicable or available on the WebAssembly platform wasm32-emscripten and wasm32-wasi . See WebAssembly Platform for details.

The urllib.request module defines the following functions:

urllib.request.urlopen(urldata=None, [timeout, ]*cafile=Nonecapath=Nonecadefault=Falsecontext=None)

Open url, which can be a string containing a valid, properly encoded URL, or a  Request object.

data must be an object giving additional data to be sent to the server, or None if no data needs to be sent. Please see Request for details.

The urllib.request module uses the HTTP/1.1 protocol and includes Connection:close header information in its HTTP requests.

timeout is an optional parameter, used to specify the timeout for blocking operations (such as connection attempts), in seconds. If not specified, the global default timeout parameter will be used). This parameter is actually only valid for HTTP, HTTPS and FTP connections.

If the context parameter is given, it must be a ssl.SSLContext  instance, used to describe various SSL parameters. See HTTPSConnection for more details.

cafile and capath are optional parameters used to specify a set of trusted CAs for HTTPS requests. Certificate. cafile should point to a single file containing the CA certificate, capath should point to Directory for hashed certificate files. For more information, see ssl.SSLContext.load_verify_locations() .

Cadefault cadefault 

This function always returns an object, which can be used as context manager with  url, headers and status Attributes. For more details about these properties, see urllib.response.addinfourl .

For HTTP and HTTPS URLs, this function will return a slightly modified http.client.HTTPResponse object. In addition to the above three new methods, there is also the msg attribute that contains the same information as the reason attribute---the reason description text returned by the server, Instead of the response header information described in the documentation of HTTPResponse .

For FTP, file, data URL, and by traditional URLopener  and FancyURLopener For requests processed by the a> class, this function will return a urllib.response.addinfourl object.

When an error occurs in the protocol, URLError will be raised.

Please note that if there is no processing function to process the request, None may be returned. Although the default installed global OpenerDirector will use UnknownHandler to ensure that this does not happen This situation.

In addition, if it is detected that a proxy is set (for example, if an environment variable such as http_proxy is set), ProxyHandler will be installed by default.  and make sure the request is handled through a proxy.

The urllib.urlopen function retained in versions below Python 2.6 has been discontinued; urllib.request.urlopen() corresponds to the traditional urllib2.urlopen . The processing of the proxy service is completed by passing the dictionary parameters to urllib.urlopen . You can use the ProxyHandler object to obtain the proxy. processing function.

will trigger an audit event by default  with parameters  and  ,  ,  are all taken from the request object.  urllib.Requestfullurldataheadersmethod

Currently updated in 3.2: Enhanced cafile g capath.

Changed in version 3.2: HTTPS virtual hosts will now be supported where possible (that is, if ssl.HAS_SNI is true value).

New features in version 3.2: data can be an iterable object.

Currently updated in 3.3: 增加了 cadefault.

Currently updated in 3.4.3: 增加了 context.

Changed in version 3.10: When context is not given, HTTPS connections now send a message with the protocol indicator  ALPN extension for http/1.1 . Custom context should use set_alpn_protocol() to set the ALPN protocol.

Removed after version 3.6: cafile , capath and cadefault are deprecated in favor of context. Please use ssl.SSLContext.load_cert_chain() instead or let ssl.create_default_context() Select the CA certificate trusted by the system.

urllib.request.install_opener(opener)

Install an instance of OpenerDirector as the default global open function. This open function only needs to be installed if urlopen uses it; otherwise, just call OpenerDirector.open() instead of urlopen(). The code does not check whether it actually belongs to the OpenerDirector class, any class with the appropriate interface will work.

urllib.request.build_opener([handler...])

Returns an OpenerDirector instance, chaining the handler functions in the given order. The processing function can be an instance of BaseHandler or a child of BaseHandler class (in which case the constructor must allow calling without any parameters). Instances of the following classes will appear before handlers unless handlers have Contains these classes, their instances, or their subclasses: ProxyHandler (if proxy settings are detected), UnknownHandler a> , HTTPHandler , HTTPDefaultErrorHandler , a>FileHandler . HTTPErrorProcessor ,  , FTPHandler , HTTPRedirectHandler

If Python is installed with SSL support (meaning it can import the ssl module),  .

BaseHandler subclass may also change its handler_order attribute to modify its position in the handlers list.

urllib.request.pathname2url(path)

Convert the path name path from the local notation of the path to the format used by the URL path part. This function does not generate a complete URL. The return value will be encoded using the quote() function.

urllib.request.url2pathname(path)

Converts the path part from the percent-encoded URL to local path writing. This function does not accept the complete URL and uses the unquote() function to pair the path Decode.

urllib.request.getproxies()

This helper function returns a dictionary mapping individual schemes to proxy server URLs. It will first scan the environment variable named <scheme>_proxy in a case-insensitive manner for all operating systems, and when it cannot be found, it will scan it from the system configuration on macOS and from Windows on Windows. Look for agent information in the system registry. If there are both lowercase and uppercase environment variables (and their contents are inconsistent), the lowercase version will be used first.

Remark

If there is an environment variable REQUEST_METHOD , it usually means that the script is running in a CGI environment, then the environment variable HTTP_PROXY (uppercase _PROXY) will be ignored. This is because it can be injected by the client using the HTTP header "Proxy:". To use an HTTP proxy in a CGI environment, use `ProxyHandler explicitly, or make sure the variable name is lowercase (or at least suffixed with _proxy ).

The following classes are provided:

class urllib.request.Request(urldata=Noneheaders={}origin_req_host=Noneunverifiable=Falsemethod=None)

Abstract class for URL request objects.

url should be a string containing a valid, correctly encoded URL.

data must be an object given additional data to be sent to the server, or None if no such data is required. Currently, the only use of data is HTTP requests. Supported object types include bytes, file-like objects, and traversable bytes-like objects. If Content-Length and Transfer-Encoding header fields are not provided, HTTPHandler will be based on  will be used to send files and other traversable objects.  is defined in section 3.3.1 RFC 7230 will be used to send a bytes object, while  sets these header fields. dataContent-LengthTransfer-Encoding: chunked

For HTTP POST request method, data should be standard application/x-www- Buffer in form-urlencoded format. urllib.parse.urlencode() The parameter of the function is a mapping object or a sequence of tuples, and returns an ASCII string in this encoding format. It should be encoded as a bytestring before being used as a data parameter.

headers should be a dictionary and will be treated as if it were called with each key and value as arguments to add_header() is  (in Python 2.6). All sent header keys use camelCase notation. urllib, while the default user-agent string for . This is often used to "masquerade" header values ​​that browsers use to identify themselves -- some HTTP servers only allow requests from normal browsers and not Request from script. For example, Mozilla Firefox might identify itself as User-Agent"Mozilla/5.0 (X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11""Python-urllib/2.6"

If the data parameter is given, the appropriate Content-Type header information should be included. If not provided and data is not None, Content-Type: application/x-www-form-urlencoded will be added as the default value.

The next two parameters are only useful for processing third-party HTTP cookies:

origin_req_host should be the request host that initiates the initial session, as defined in RFC 2965 . The default is http.cookiejar.request_host(self) . This is the hostname or IP address from which the user made the initial request. Assuming the request is for image data in an HTML document, this attribute should be the host from which the request was made for the page containing the image.

unverifiable should indicate whether the request cannot be verified, as defined in RFC 2965 . The default value is False . The so-called unverifiable request means that the user does not have the opportunity to verify the requested URL. For example, if the request is for an image in an HTML document and the user does not have the opportunity to allow the image to be read automatically, this parameter should be True.

method should be a string identifying the HTTP request method to use (e.g. 'HEAD' ). If this parameter is given, its value will be stored in the method attribute and passed by get_method() attribute to indicate different default request methods. method. Subclasses can set the  , otherwise it is  , the default value is  is data is used. If None'GET''POST'

Remark

If the data object cannot deliver its content multiple times (such as a file or iterable object that can only generate content once) and request retry behavior occurs due to HTTP redirection or authentication, the request will not behave properly Work. data is sent to the HTTP server next to the header information. The existing library does not support HTTP 100-continue polling.

Changed in version 3.3: The Request class added the Request.method parameter.

Changed in version 3.4: Default Request.method can be specified in the class.

Changed in version 3.6: If Content-Length is given and data is neither None is not a bytes object, so no error will be triggered. Instead, the next best thing is to use the encoding format of block transmission.

class urllib.request.OpenerDirector

The OpenerDirector class opens the URL through the concatenated BaseHandler and is responsible for managing the handler chain and Recover from errors.

class urllib.request.BaseHandler

This is the base class for all registered handlers and only provides a simple registration mechanism.

class urllib.request.HTTPDefaultErrorHandler

is the default handler defined for HTTP error responses. All error responses will be converted to HTTPError exceptions.

class urllib.request.HTTPRedirectHandler

A class for handling redirects.

class urllib.request.HTTPCookieProcessor(cookiejar=None)

A class for handling HTTP Cookies.

class urllib.request.ProxyHandler(proxies=None)

Forward the request to the proxy service. If proxies is given, it must be a dictionary mapping protocol names to proxy URLs. The default is to read the proxy list from the environment variable <protocol>_proxy . If the environment variable for the proxy service is not set, the proxy settings are obtained from the Internet Settings section of the registry in Windows environments, and the proxy information is obtained from the System Configuration Framework in macOS environments.

To disable automatically detected proxies, pass in an empty dictionary object.

Environment variable no_proxy can be used to specify hosts that do not need to be accessed through a proxy; it should be a comma-separated list of host name suffixes, you can add :port , for example cern.ch,ncsa.uiuc.edu,some.host:8080 .

Remark

If the REQUEST_METHOD variable is set, HTTP_PROXY will be ignored; see getproxies()< a i=4> document.

class urllib.request.HTTPPasswordMgr

Maintain (realm, uri) -> (user, password) mapping database.

class urllib.request.HTTPPasswordMgrWithDefaultRealm

Maintain (realm, uri) -> (user, password) mapping database. A realm of None is considered a full match and will be retrieved if there is no other suitable safe zone.

class urllib.request.HTTPPasswordMgrWithPriorAuth

A variant of HTTPPasswordMgrWithDefaultRealm , also with a uri -> is_authenticated mapping database. Can be used by BasicAuth handlers to determine when to send authentication credentials immediately, rather than waiting for a 401 response first.

New features in version 3.5.

class urllib.request.AbstractBasicAuthHandler(password_mgr=None)

This is a hybrid class that helps with HTTP authentication, both for remote hosts and proxies. The parameter password_mgr should be compatible with HTTPPasswordMgr ; for information on which interfaces must be supported, please See the HTTPPasswordMgr Object object section. If password_mgr also provides the is_authenticated and update_authenticated methods (see < a i=11>HTTPPasswordMgrWithPriorAuth object object), the handler will use the result of  for the given URI to determine whether to send authentication credentials with the request. If  for that URI returns  , the credentials are sent. If  is  , no credentials are sent, then if a  response is received, the request is resent using the authentication credentials. If the identity authentication is successful, call  to set the  of the URI to , so that subsequent URI or all its parent URIs Requests for will automatically include the authentication credentials. is_authenticatedis_authenticatedTrueis_authenticatedFalse401update_authenticatedis_authenticatedTrue

New version 3.5 features: Added support for is_authenticated .

class urllib.request.HTTPBasicAuthHandler(password_mgr=None)

Handles the authentication of the remote host. password_mgr should be compatible with HTTPPasswordMgr ; regarding which interfaces must be supported, See the HTTPPasswordMgr Object section. If the wrong authentication method is given, HTTPBasicAuthHandler will trigger ValueError .

class urllib.request.ProxyBasicAuthHandler(password_mgr=None)

Handles identity authentication when there is a proxy service. password_mgr should be compatible with HTTPPasswordMgr ; regarding which interfaces must be supported, See the HTTPPasswordMgr Object section.

class urllib.request.AbstractDigestAuthHandler(password_mgr=None)

This is a hybrid class that helps with HTTP authentication, both for remote hosts and proxies. The parameter password_mgr should be compatible with HTTPPasswordMgr ; for information on which interfaces must be supported, please See the section on the HTTPPasswordMgr object .

class urllib.request.HTTPDigestAuthHandler(password_mgr=None)

Handles the authentication of the remote host. password_mgr should be compatible with HTTPPasswordMgr ; regarding which interfaces must be supported, See the HTTPPasswordMgr Object section. If both a digest authentication handler and a basic authentication handler are added, digest authentication will be tried first. If digest authentication returns a 40x response, it will be sent to the basic authentication handler for processing. If an authentication method other than Digest and Basic is given, this handler method will trigger ValueError .

Changed in version 3.3: When encountering an unsupported authentication method, ValueError will be triggered.

class urllib.request.ProxyDigestAuthHandler(password_mgr=None)

Handles identity authentication when there is a proxy service. password_mgr should be compatible with HTTPPasswordMgr ; regarding which interfaces must be supported, See the HTTPPasswordMgr Object section.

class urllib.request.HTTPHandler

Handler class for opening HTTP URLs.

class urllib.request.HTTPSHandler(debuglevel=0context=Nonecheck_hostname=None)

Handler class for opening HTTPS URLs. context and check_hostname have the same meaning as http.client.HTTPSConnection is the same.

Currently updated in 3.2: Added context 和 check_hostname Number of references.

class urllib.request.FileHandler

Open a local file.

class urllib.request.DataHandler

Open the data URL.

3.4 New features.

class urllib.request.FTPHandler

Open the FTP URL.

class urllib.request.CacheFTPHandler

Opens an FTP URL and caches open FTP connections to minimize latency.

class urllib.request.UnknownHandler

A fallback class that handles all URLs of unknown type.

class urllib.request.HTTPErrorProcessor

Handle erroneous HTTP responses.

Request object

The following methods introduce the public interface of Request , so subclasses can override all these methods. Several public properties are also defined here that clients can use to understand the parsed request.

Request.full_url

The original URL passed to the constructor.

Changed in version 3.4.

Request.full_url is a property with setter, getter and deleter. Reading the full_url property will return the initial request URL with the fragment attached.

Request.type

URI mode.

Request.host

URI permissions, usually the entire host, but may also have colon-separated port numbers.

Request.origin_req_host

The original host of the request, without port.

Request.selector

URI path. If Request uses a proxy, the selector will be the full URL passed to the proxy.

Request.data

Requested data body, if not given, it is None .

Changed in version 3.4: Now if the value of Request.data is modified, the previously set or calculated "Content-Length" will be deleted. "Header information.

Request.unverifiable

Boolean value, indicating whether this request falls into the unverifiable situation defined in RFC 2965 .

Request.method

The HTTP request method to use. The default is None, which means get_method() will process the method normally . Setting this value can override the default processing process in get_method() . The setting method can be in Request parameter. /span>  The constructor passes in a value. methodRequest< through the  , and can also be given to 

3.3 New features.

Changed in version 3.4: Default values ​​can now be set in subclasses; previously they could only be set via constructor arguments.

Request.get_method()

Returns a string representing the HTTP request method. If Request.method is not None , its value is returned. Otherwise, if Request.data is, return 'GET', if not None , return < /span> 'POST' . Only valid for HTTP requests.

Changed in version 3.3: get_method now respects the value of Request.method .

Request.add_header(keyval)

Adds a header to the request. Headers are currently ignored by all handlers except the HTTP handler, which adds them to the list of headers sent to the server. Please note that there can only be one header with the same name. When key conflicts, subsequent calls will overwrite the previous calls. Currently, this does not result in a loss of HTTP functionality, as all headers that can be used multiple times while still being meaningful have ways (specific to that specific header) to achieve the same functionality as using just one header. Note that headers added using this method will also be added to the redirected request.

Request.add_unredirected_header(keyheader)

Adds a header that will not be included in redirected requests.

Request.has_header(header)

Returns whether this instance has named header information (detected for both regular data and non-redirected data).

Request.remove_header(header)

Removes the specified named header information from this request instance (detected for both regular data and non-redirected data).

3.4 New features.

Request.get_full_url()

Returns the URL given in the constructor.

Changed in version 3.4.

返回 Request.full_url

Request.set_proxy(hosttype)

Connect to the proxy server to prepare for the current request. host and type will replace the corresponding values ​​in this example, selector Will be the initial URL given in the constructor.

Request.get_header(header_namedefault=None)

Returns the data given the header information. If the header information does not exist, the default value is returned.

Request.header_items()

Returns header information, a list of tuples in the form of (name, data).

Changed in version 3.4: The following methods, deprecated since 3.3, have been removed: add_data, has_data, get_data, get_type, get_host, get_selector, get_origin_req_host, and is_unverifiable.

OpenerDirector object

The OpenerDirector instance has the following methods:

OpenerDirector.add_handler(handler)

handler should be an instance of BaseHandler . Methods of the following types will be retrieved and added to the corresponding processing chain (note that HTTP errors are a special case). Please note that protocol below should be replaced with the actual protocol to be processed, for example http_response() would be HTTP protocol response processing function. And type should also be replaced with the actual HTTP code, for example http_error_404() will handle HTTP 404 errors.

  • <protocol>_open() — Indicates that the handler knows how to open the URL of the protocol protocol.

    For more information, see BaseHandler.<protocol>_open() .

  • http_error_<type>() — Indicates that this handler knows how to handle HTTP errors with code type .

    For more information, see BaseHandler.http_error_<nnn>() .

  • <protocol>_error() — Indicates that this handler knows how to handle errors from protocol protocol (not http).

  • <protocol>_request() — Indicates that this handler knows how to preprocess requests with protocol protocol .

    For more information, see BaseHandler.<protocol>_request() .

  • <protocol>_response() — Indicates that this handler knows how to post-process responses with protocol protocol .

    For more information, see BaseHandler.<protocol>_response() .

OpenerDirector.open(urldata=None[, timeout])

Open the given url (can be a request object or a string), you can choose to pass in the given  parameter specifies a timeout value for blocking operations such as connection attempts (if not specified, the global default timeout setting will be used). The timeout feature only applies to HTTP, HTTPS and FTP connections. timeout method). The optional open() calls the OpenerDirector (it simply adds the global urlopen(). The parameters, return value, and exceptions raised are the same as data

OpenerDirector.error(proto*args)

Handle errors for the given protocol. The registered error handler for the given protocol will be called with the given arguments (protocol dependent). The HTTP protocol is a special case, and the HTTP response code is used to determine the specific error handler; please refer to the http_error_<type>() method of the handler class.

The return value and exception are the same as urlopen() .

The OpenerDirector object opens a URL in 3 stages:

The order in which these methods are called in each phase depends on the order of handler instances.

  1. Every handler with a method like _request() will call this method to preprocess the request.

  2. Calls a handler with methods like _open() to handle the request. When the handler returns a non-None value (that is, a response) or throws an exception (usually URLError), this phase ends. Exceptions can be propagated at this stage.

    In fact, the above algorithm will first try the method named default_open() . If these methods all return None , the algorithm is repeated for the method named <protocol>_open() . If these methods also all return None, the algorithm will be repeated for the method named unknown_open() .

    Please note that the code for these methods may call OpenerDirector the parent instance's open()< /span> methods. error() and 

  3. Each handler with methods of this type _response() will call these methods to post-process the response.

BaseHandler object

The BaseHandler object provides some directly available methods, as well as other methods that can be used by derived classes. Here are the methods available for immediate use:

BaseHandler.add_parent(director)

Add director as parent OpenerDirector.

BaseHandler.close()

Remove all parent OpenerDirectors.

The following properties and methods are only available to subclasses of BaseHandler :

Remark

The convention has been adopted that subclasses defining <protocol>_request() or <protocol>_response() methods are named *Processor; all others are named *Handler.

BaseHandler.parent

An available OpenerDirector that can be used to open URIs with other protocols, or handle errors.

BaseHandler.default_open(req)

This method is BaseHandler not defined. But its subclasses should be defined if they want to capture all URLs.

If this method is implemented, it will be called by the parent class OpenerDirector . It should return a return value such as OpenerDirector 's open() method The described file class object, or return None. It should raise URLError unless a true exception occurs (for example, MemoryError should not be mapped as URLError).

This method will be called before the open method of all protocols.

BaseHandler.<protocol>_open(req)

This method is BaseHandler not defined. But its subclasses should be defined if they want to handle URLs of a given protocol.

If this method is defined, it will be called by the parent OpenerDirector object. The return value is the same as default_open() .

BaseHandler.unknown_open(req)

This method is BaseHandler not defined. But if its subclasses want to capture and open all URLs without registered handlers, they should be defined.

If this method is implemented, it will be the parent pointed to by the parent attribute OpenerDirector . default_open() is called. The return value is the same as 

BaseHandler.http_error_default(reqfpcodemsghdrs)

This method is BaseHandler not defined. However, if its subclasses want to provide a fallback method for all HTTP errors that do not define a handler, they should be overridden. OpenerDirector will automatically call this method to obtain error information, but usually should not be called at other times.

req will be a Request object, fp< /span> is a three-digit error code, is an error header A dictionary object of information. hdrs is an explanation message for users to read, msgcode is a file object with an HTTP error body,

The return value and triggered exception should be the same as urlopen() .

BaseHandler.http_error_<nnn>(req, fp, code, msg, hdrs)

nnn should be a three-digit HTTP error code. This method is not defined in BaseHandler , but when an instance of a subclass occurs, the code is nnnWhen there is an HTTP error of  , the method will be called if it exists.

Subclasses should override this method to handle corresponding HTTP errors.

The parameters, return value and triggered exception should be the same as http_error_default() .

BaseHandler.<protocol>_request(req)

This method is BaseHandler not defined. However, its subclasses should be defined if they want to preprocess requests for a given protocol.

If this method is implemented, it will be called by the parent OpenerDirector . req will be the Request object. The return value should be a Request object.

BaseHandler.<protocol>_response(req, response)

This method is BaseHandler not defined. However, its subclasses should be defined if they want to post-process requests for a given protocol.

If this method is implemented, it will be called by the parent OpenerDirector . req will be the Request object. response should implement the same interface as the return value of urlopen() . The return value should implement the same interface as the return value of urlopen() .

HTTPRedirectHandler object

Remark

Some HTTP redirect operations require functionality provided by this module's client code. At this time, HTTPError will be triggered. See RFC 2616 for the exact meaning of the various redirection codes.

If the redirect URL given to HTTPRedirectHandler is not an HTTP, HTTPS or FTP URL, a HTTPError exception will be triggered for security reasons.

HTTPRedirectHandler.redirect_request(reqfpcodemsghdrsnewurl)

Return a Request or None in response to a redirect. This is called by the default implementations of the http_error_30*() methods when a redirection is received from the server. If a redirection should take place, return a new Request to allow http_error_30*() to perform the redirect to newurl. Otherwise, raise HTTPError if no other handler should try to handle this URL, or return None if you can't but another handler might.

Remark

The default implementation code of this method does not strictly follow RFC 2616, that is, POST requested 301 and 302 responses must not be automatically redirected without user confirmation. In reality, browsers do allow automatic redirection of these responses, changing the POST to GET , so the default implementation code reproduces this processing method.

HTTPRedirectHandler.http_error_301(reqfpcodemsghdrs)

redirects to Location: or URI: URL. This method will be called by the parent OpenerDirector when an HTTP 'moved permanently' response is obtained.

HTTPRedirectHandler.http_error_302(reqfpcodemsghdrs)

is the same as http_error_301() , but is called when a "found" response occurs.

HTTPRedirectHandler.http_error_303(reqfpcodemsghdrs)

is the same as http_error_301() , but is called when a "see other" response occurs.

HTTPRedirectHandler.http_error_307(reqfpcodemsghdrs)

Same as http_error_301() , but called for a 'temporary redirect' response. It does not allow changing the request method from POST to GET.

HTTPRedirectHandler.http_error_308(reqfpcodemsghdrs)

Same as http_error_301() , but called for a 'permanent redirect' response. It does not allow changing the request method from POST to GET.

New features in version 3.11.

HTTPCookieProcessor object

Instances of HTTPCookieProcessor have one attribute:

HTTPCookieProcessor.cookiejar

cookie exists http.cookiejar.CookieJar .

ProxyHandler object

ProxyHandler.<protocol>_open(request)

ProxyHandler will prepare one  method for each protocol , the corresponding proxy server information is included in the proxies dictionary given by the constructor. By calling , this method will transfer the request through the proxy server and call the next handler in the handler chain to complete the corresponding protocol processing. _open()request.set_proxy()

HTTPPasswordMgr object

The following methods HTTPPasswordMgr and HTTPPasswordMgrWithDefaultRealm objects are provided .

HTTPPasswordMgr.add_password(realmuriuserpasswd)

uri can be a single URI or a list of URIs. realm, user and passwd  must be a string. This allows  to be used as an authentication token when authenticating to realm and super URIs. (user, passwd)

HTTPPasswordMgr.find_user_password(realmauthuri)

Get the username and password for the given realm and URI. If there is no matching username and password, this method will return (None, None) .

For HTTPPasswordMgrWithDefaultRealm object, if given realm none For matching usernames and passwords, realm None will be searched.

HTTPPasswordMgrWithPriorAuth object

This is an extension to HTTPPasswordMgrWithDefaultRealm to keep track of URIs that require always sending authentication credentials.

HTTPPasswordMgrWithPriorAuth.add_password(realmuriuserpasswdis_authenticated=False)

realm, uri, user , passwd conflicts HTTPPasswordMgr.add_password() homology. is_authenticated  is_authenticated   As a result is_authenticated 设为 True ,则会忽将 realm< a i=17>.

HTTPPasswordMgrWithPriorAuth.find_user_password(realmauthuri)

is the same as HTTPPasswordMgrWithDefaultRealm object.

HTTPPasswordMgrWithPriorAuth.update_authenticated(selfuriis_authenticated=False)

Update definition uri or URI sequence table is_authenticated Sign.

HTTPPasswordMgrWithPriorAuth.is_authenticated(selfauthuri)

Returns the current state of the given URI is_authenticated flag.

AbstractBasicAuthHandler object

AbstractBasicAuthHandler.http_error_auth_reqed(authreqhostreqheaders)

Handle authentication requests by obtaining the username and password and retrying the request. authreq should be the name of the header information containing realm in the request, host Specifies the URL and path that require authentication, req should be (failed) Request should be the wrong header information. headers object, 

host is either an authentication information (such as "python.org" ) or a URL containing authentication information (such as "http://python.org/" ). Regardless of the format, user information cannot be included in the authentication information (so "python.org" and "python.org:80" are fine, but "joe:[email protected]" is no).

HTTPBasicAuthHandler object

HTTPBasicAuthHandler.http_error_401(reqfpcodemsghdrs)

If available, retry the request with the authentication information.

ProxyBasicAuthHandler object

ProxyBasicAuthHandler.http_error_407(reqfpcodemsghdrs)

If available, retry the request with the authentication information.

AbstractDigestAuthHandler object

AbstractDigestAuthHandler.http_error_auth_reqed(authreqhostreqheaders)

authreq should be the name of the header information about realm in the request, host should be the name that needs to be Authentication host, req should be (failed) Request Object, headers should be the error header information.

HTTPDigestAuthHandler object

HTTPDigestAuthHandler.http_error_401(reqfpcodemsghdrs)

If available, retry the request with the authentication information.

ProxyDigestAuthHandler object

ProxyDigestAuthHandler.http_error_407(reqfpcodemsghdrs)

If available, retry the request with the authentication information.

HTTPHandler object

HTTPHandler.http_open(req)

Sends an HTTP request, which may be in GET or POST format depending on the result of req.has_data() .

HTTPSHandler object

HTTPSHandler.https_open(req)

Sends an HTTPS request, which may be in GET or POST format depending on the result of req.has_data() .

FileHandler object

FileHandler.file_open(req)

If there is no host name or the host name is 'localhost' , open the local file.

Changed in version 3.2: This method only works with local hostnames. If a remote hostname is given, URLError will be triggered.

DataHandler object

DataHandler.data_open(req)

Read the URL containing data. The URL itself contains encoded data. The syntax definition of data URL is given in RFC 2397 . The current codebase ignores whitespace in base64-encoded data URLs, so the URLs can be placed in any source file. The current codebase still raises a ValueError if the base64-encoded tail of the data URL is missing padding, even though some browsers don't mind.

FTPHandler object

FTPHandler.ftp_open(req)

Open the FTP file given by req . The username and password when logging in are always empty.

CacheFTPHandler object

The CacheFTPHandler object is the FTPHandler object that adds the following method:

CacheFTPHandler.setTimeout(t)

Set the connection timeout to t seconds.

CacheFTPHandler.setMaxConns(m)

Set the maximum number of cached connections to m .

UnknownHandler object

UnknownHandler.unknown_open()

Touch URLError Unusual.

HTTPErrorProcessor object

HTTPErrorProcessor.http_response(requestresponse)

Handle erroneous HTTP responses.

For 200 error codes, the response object should be returned immediately.

For error codes other than 200, only pass the task to the handler through OpenerDirector.error() http_error_<type>() Method. If no handler ends up handling the error, HTTPDefaultErrorHandler will trigger HTTPError.

HTTPErrorProcessor.https_response(requestresponse)

Handling of HTTPS error responses.

given http_response() method homology.

example

More examples are given in How to use the urllib package to obtain network resources .

The following example reads the python.org homepage and displays the first 300 bytes of its content:

>>>

>>> import urllib.request
>>> with urllib.request.urlopen('http://www.python.org/') as f:
...     print(f.read(300))
...
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html
xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\n\n<head>\n
<meta http-equiv="content-type" content="text/html; charset=utf-8" />\n
<title>Python Programming '

Note that urlopen will return a bytes object. This is because urlopen cannot automatically determine the encoding of the byte stream received by the HTTP server. In general, the returned bytes object should be decoded into a string as long as the encoding format can be determined or guessed.

The following W3C document Character encodings lists various schemes that can be used to specify encoding information for (X)HTML or XML documents.

The python.org website has specified in the meta tag that it uses utf-8 encoding, so the same format will be used here. The byte string is decoded.

>>>

>>> with urllib.request.urlopen('http://www.python.org/') as f:
...     print(f.read(100).decode('utf-8'))
...
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtm

The same result can be obtained without context manager method:

>>>

>>> import urllib.request
>>> f = urllib.request.urlopen('http://www.python.org/')
>>> print(f.read(100).decode('utf-8'))
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtm

The following example will send a data stream to a CGI's stdin and read the returned data. Note that this example will only work in environments where Python has SSL support installed.

>>>

>>> import urllib.request
>>> req = urllib.request.Request(url='https://localhost/cgi-bin/test.cgi',
...                       data=b'This data is passed to stdin of the CGI')
>>> with urllib.request.urlopen(req) as f:
...     print(f.read().decode('utf-8'))
...
Got Data: "This data is passed to stdin of the CGI"

The CGI code in the above example looks like this:

#!/usr/bin/env python
import sys
data = sys.stdin.read()
print('Content-type: text/plain\n\nGot Data: "%s"' % data)

The following is an example of sending a  request using Request : PUT

import urllib.request
DATA = b'some data'
req = urllib.request.Request(url='http://localhost:8080', data=DATA, method='PUT')
with urllib.request.urlopen(req) as f:
    pass
print(f.status)
print(f.reason)

Basic HTTP authentication example:

import urllib.request
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
                          uri='https://mahler:8092/site-updates.py',
                          user='hug',
                          passwd='kadidd!ehopper')
opener = urllib.request.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib.request.install_opener(opener)
urllib.request.urlopen('http://www.example.com/login.html')

build_opener() provides many ready-made handlers by default, including ProxyHandler. By default, ProxyHandler will use an environment variable named <scheme>_proxy where <scheme> is the relevant URL protocol. For example, the http_proxy environment variable can be read to obtain the URL of the HTTP proxy.

This example replaces the default ProxyHandler with using a programmatically provided proxy URL and passing ProxyBasicAuthHandler Adds proxy authentication support.

proxy_handler = urllib.request.ProxyHandler({'http': 'http://www.example.com:3128/'})
proxy_auth_handler = urllib.request.ProxyBasicAuthHandler()
proxy_auth_handler.add_password('realm', 'host', 'username', 'password')

opener = urllib.request.build_opener(proxy_handler, proxy_auth_handler)
# This time, rather than install the OpenerDirector, we use it directly:
opener.open('http://www.example.com/login.html')

Add HTTP header information:

can use the Request constructor's headers parameter, Or:

import urllib.request
req = urllib.request.Request('http://www.example.com/')
req.add_header('Referer', 'http://www.python.org/')
# Customize the default User-Agent header value:
req.add_header('User-Agent', 'urllib-example/0.1 (Contact: . . .)')
r = urllib.request.urlopen(req)

OpenerDirector will automatically add an item to each Request User-Agent header information. To modify, see the following statement:

import urllib.request
opener = urllib.request.build_opener()
opener.addheaders = [('User-agent', 'Mozilla/5.0')]
opener.open('http://www.example.com/')

Also remember, when Request is passed to urlopen() (or OpenerDirector.open()), some standard header information will be added ( Content- Length , Content-Type and Host).

The following session example uses the GET method to read a URL containing parameters.

>>>

>>> import urllib.request
>>> import urllib.parse
>>> params = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> url = "http://www.musi-cal.com/cgi-bin/query?%s" % params
>>> with urllib.request.urlopen(url) as f:
...     print(f.read().decode('utf-8'))
...

The following examples use the POST method instead. Please note that the urlencode output is first encoded into the byte string data and then fed into urlopen.

>>>

>>> import urllib.request
>>> import urllib.parse
>>> data = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> data = data.encode('ascii')
>>> with urllib.request.urlopen("http://requestb.in/xrbl82xr", data) as f:
...     print(f.read().decode('utf-8'))
...

The following example explicitly specifies an HTTP proxy, overriding the setting in environment variables:

>>>

>>> import urllib.request
>>> proxies = {'http': 'http://proxy.example.com:8080/'}
>>> opener = urllib.request.FancyURLopener(proxies)
>>> with opener.open("http://www.python.org") as f:
...     f.read().decode('utf-8')
...

The following example uses no proxy at all and overrides settings in environment variables:

>>>

>>> import urllib.request
>>> opener = urllib.request.FancyURLopener({})
>>> with opener.open("http://www.python.org/") as f:
...     f.read().decode('utf-8')
...

Deactivated interface

The following functions and classes are ported from the Python 2 module urllib (relatively earlier than urllib2 ). It may be discontinued at some point in the future.

urllib.request.urlretrieve(urlfilename=Nonereporthook=Nonedata=None)

Copy a network object as a URL to a local file. If the URL points to a local file, the file name must be provided for copying to occur. The return value is a tuple (filename, headers) , where filename is the local file name to save the network object, headers is the remote object returned by urlopen() info() method the result of the call. The exceptions that may be triggered are the same as urlopen() .

The second parameter specifies the save location of the file (if not given, it will be a temporary file with a randomly generated name). The third parameter is a callable object, which will be called once when establishing a network connection, and will be called once after each data block is read. This callable object will be passed in 3 parameters: the number of blocks transferred, the size of the blocks in bytes, and the total size of the file. If you are facing an old FTP server, the file size parameter may be -1 , these servers will not return the file size when responding to read requests.

The following examples demonstrate most common scenarios:

>>>

>>> import urllib.request
>>> local_filename, headers = urllib.request.urlretrieve('http://python.org/')
>>> html = open(local_filename)
>>> html.close()

If url uses an identifier of the form http: , optional  function. urllib.parse.urlencode() Format byte string object; see application/x-www-form-urlencoded Parameters must be standard data).  request (the usual request type is  parameter to specify a dataPOSTGET

urlretrieve() detects that the available data is less than the expected size (i.e. caused by Content-Length The size reported by the header) will raise ContentTooShortError. This can happen, for example, when a download is interrupted.

Content-Length will be considered as the lower limit of size: if there is more data available, urlretrieve will read more data, but if there is less data available, urlretrieve will read more data. , an exception will be thrown.

In this case you can still get the downloaded data, which will be stored in the exception instance's content attribute.

If the Content-Length header is not provided, urlretrieve cannot check the size of the data it downloads and simply returns it. In this case you can only assume that the download was successful.

urllib.request.urlcleanup()

Clean up temporary files that may have been left behind when calling urlretrieve() before.

class urllib.request.URLopener(proxies=None**x509)

Removed after version 3.3.

Base class for opening and reading URLs. Unless you need to support opening the object in a way other than http:ftp: or file: , then you can probably use  FancyURLopener.

By default, the URLopener class will send a < with the content urllib/VVV a i=4>User-Agent  header, where VVV is urllib  The version number. Applications can do this by subclassing URLopener or FancyURLopener and subclassing In the definition, set the class attribute version to the appropriate string value to define your own User-Agent header.

Optional proxies The formal parameter should be a dictionary mapping method names to proxy URLs. If the dictionary is empty, it will be completely closed. acting. Its default value is None, in which case it will be used if the environment proxy setting is present, as above urlopen()< a i=5> as described in the definition.

Some additional keyword parameters belonging to x509 can be used when using the https: method Client validation. Supports providing SSL keys and certificates through keywords key_file and cert_file ;Support for client-side validation requires the provision of these two formal parameters.

If the server returns an error code then the URLopener object will raise OSError Abnormal.

open(fullurldata=None)

Open fullurl using the appropriate protocol. This method sets cache and proxy information and then calls the appropriate open method with its input parameters. If the method is not recognized, open_unknown() will be called. data The meaning of the parameters is the same as urlopen()  data parameters are the same.

This method will always use quote() for fullurl Carry out transcoding.

open_unknown(fullurldata=None)

Overloadable interface for opening unknown URL types.

retrieve(urlfilename=Nonereporthook=Nonedata=None)

Extract the content of url and store it in filename Medium. The return value is a tuple consisting of a local filename and an email.message.Message object containing the response headers (for remote URLs) or  will be ignored. reporthook is given, it must be a function that accepts three numeric parameters: the data chunk number, the maximum amount of data to read into the chunk, and The total amount of data downloaded (-1 if unknown). It will be called at the beginning and after each chunk of data is read from the network. For local URL reporthook . If tempfile.mktemp() is not given, the filename is filename is not given and the URL points to a local file, the input filename is returned. If the URL is non-local and filename . If filenameNone (for local URL). The caller must then open and read the contents of 

If url uses an identifier of the form http: , optional  function. urllib.parse.urlencode() Format; see application/x-www-form-urlencoded Parameters must be standard data).  request (the usual request type is  parameter to specify a dataPOSTGET

version

A variable that specifies the user agent name of the opener object. In order to let urllib tell the server that it is a specific user agent, please set it as a class variable in the subclass or call the base class Constructor is previously set in the constructor.

class urllib.request.FancyURLopener(...)

Removed after version 3.3.

FancyURLopener subclasses URLopener to provide default handling of the following HTTP response code: 301 , 302, 303, 307 and 401. For the 30x response code above, the Location header is used to get the actual URL. For 401 response codes (authentication required), basic HTTP authentication is performed. For 30x response codes, the number of recursion levels is limited by the value of the maxtries attribute, which defaults to 10.

For all other response codes, the http_error_default() method is called, which you can override in subclasses to handle errors correctly.

Remark

Per RFC 2616 , 301 and 302 responses to POST requests may not be automatically redirected without user confirmation. . In reality, browsers do allow for automatic redefinition of these responses, changing POST to GET, so urllib will reproduce this behavior.

The shape passed to this constructor is the same as URLopener .

Remark

When performing basic authentication, FancyURLopener instance calls its prompt_user_passwd()  method. The default implementation will query the user for the information required by the controlling terminal. Subclasses can override this method if necessary to support more appropriate behavior.

The FancyURLopener class comes with an additional method that should be overridden to provide appropriate behavior:

prompt_user_passwd(hostrealm)

Returns the information required to authenticate the user on the given host under the specified security system. The returned value should be a tuple (user, password), which can be used for basic validation.

The implementation prompts this information on the terminal; applications should override this method to use the appropriate interaction model for the local environment.

urllib.request Restrictions

  • Currently, only the following protocols are supported: HTTP (versions 0.9 and 1.0), FTP, local files, and data URLs.

    Changed in version 3.4: Added support for data URLs.

  • The caching feature of urlretrieve() has been disabled, waiting for someone to have time to properly handle the expiration time header.

  • There should be a function to query whether a specific URL is in the cache.

  • To maintain backward compatibility, if a URL appears to point to a local file but the file cannot be opened, the URL is reinterpreted using the FTP protocol. This can sometimes lead to confusing error messages.

  • The urlopen() and urlretrieve() functions will cause an arbitrary length of time while waiting for the network connection to be established. time delay. This means that it is very difficult to build an interactive web client without using threads.

  • is returned by urlopen() or urlretrieve() The data is the original data returned by the server. This can be binary data (such as images), plain text or HTML code, etc. The HTTP protocol provides type information in the response headers, which can be viewed by reading the Content-Type header. If the returned data is HTML, you can use the html.parser module to parse it.

  • The code handling the FTP protocol cannot distinguish between files and directories. This can cause unexpected behavior when trying to read a URL that points to an inaccessible URL. If the URL ends with a / , it is assumed to point to a directory and will be processed accordingly. But if an attempt to read a file results in a 550 error (indicating that the URL cannot be found or is inaccessible, often due to permissions), the path is treated as a directory so that the URL is specified as a directory but the end is omitted / situation. This can cause misleading results when you try to get a file that is inaccessible because it has read permissions set; the FTP code will try to read it, fail with a 550 error, and then execute for the unreadable file Directory listing operations. If fine-grained control is required, consider using the ftplib module, subclassing FancyURLopener, or modify _urlopener to meet your needs.

urllib.response --- Response class used by urllib

The urllib.response module defines some functions and provides minimized file interfaces including read() and readline() etc. the type. The functions defined by this module will be used internally by the urllib.request module. A typical response object is a urllib.response.addinfourl instance:

class urllib.response.addinfourl

url

The URL of the resource that was read, typically used to determine whether a redirect occurred.

headers

Returns the response headers as an instance of EmailMessage .

status

New features in version 3.9.

Status code returned by the server.

geturl()

Removed since version 3.9: Deprecated, it is recommended to use url instead.

info()

Removed since version 3.9: Deprecated, it is recommended to use headers instead.

code

Removed since version 3.9: Deprecated, it is recommended to use status instead.

getcode()

Removed since version 3.9: Deprecated, it is recommended to use status instead.

Guess you like

Origin blog.csdn.net/TalorSwfit20111208/article/details/135025153