Reptile study notes the following day (urllib Library)

1.urllib library (Python HTTP request comes with built-in library): request module (analog transmission request); error (an exception handling module); the parse (tool module, the processing URL); robotparser (recognition site robots.txt file).

  1.1https://docs.python.org/3/library/urllib.request.htmlThe official manual. request module: urlopen () method, the site crawling HTTPResponse returns an object type, the object with a read (), readinto (), getheader (name), getheaders (), fileno () method and the like msg, versuin, status , reason, debuglevel, closed and other attributes. , The urlopen the API: urllib.request.urlopen (url, data = None, [timeout,] *, cafile = None.capath = None, cadefault = False, context = None) >>>>>> Request () class, Object returns a request type, class urllib.request.Request (url, data = None, headers = {}, origine_req_host = None, unverifiable = False, method = None) >>>>>> Handler class, various processors , have to deal with login authentication, there are handles cookies, there are processing proxy settings, the parent class is provided default_open (), protocol_request () method, etc. BaseHandler, inheritance various handler classes have subclasses BaseHandler HTTPDefaultErrorHandler (processing HTTP corresponding error), HTTPRedirectHandler (for redirection), HTTPCookieProcessor (for processing cookies), ProxyHandler (for setting agent), HTTPPasswordMgrWithDefaultRealm (used to manage the passwords), HTTPBasicAuthHandler (for authentication manager) >>>>>> OpenerDirector class, there are open () method to construct Opener, buile_opener use Handler.

  1.2error module: URLError class is the base class for this module, the module generates a request such processing exceptions can >>>>>> HTTPError subclass UrlError, designed to handle HTTP request error code has three attributes ( returns an HTTP status code), reason (cause of the error returns), headers (return request header)

  1.3parse module (URL defines a standard interface processing, e.g. implement URL extracting each part, and combined links conversion): urlparse (urlstring, scheme = '', allow_fragments = True) method, implemented to identify and segment the URL, ( scheme: // netloc / path; params query # fragment):? // in front of that scheme, on behalf of the agreement, in front of the first / symbol is netloc, namely domain name, followed by the path, that is the access path, the semicolon is params, on behalf of the parameter, followed by a question mark is the query query, generally used GET type of URL, behind the pound sign is the anchor for the pull-down position located directly inside the page. >>>>>> urlunparse (6 th iteration may be a list of tuples or objects) configured URL. >>>>>> urlsplit () method is similar to the urlparse (), but does not resolve this params separate part, incorporate it into the path. >>>>>> urlunsplit (5 th iteration may be a list of tuples or objects) configured URL. >>>>>> urljoin (base_url, new links) method, if the scheme, netloc, path does not exist in these three new link, the supplement to use base_url, exist on the use of new links. >>>>>> urlencode (dictionary), the parameter sequence into a GET request dictionary parameter parse_qs (), deserialization, back to the GET request dictionary. >>>>>> parse_qsl () deserialization, the parameter It is converted into a list of tuples >>>>>> quote (Chinese), Chinese characters into the URL-encoded >>>>>> unquote (), URL decoding

  1.4robotparser module: RobotFileParser class, urllib.robotparser.RobotFileParser (url = ''), such conventional methods set_url (rrobots.txt file link), read () and the file reading robots.txt analysis, parse () analytical robots .txt file, can_fetch (user-agent, URL), returns a Boolean type, determine whether the engines can crawl this URL, mtime (), return to the last crawl and analyze robot.txt time, modified (), the current last time is set to crawl and analyze robots.txt time

2.isinstance () determines whether the object is a known type, similar to the type ().

  Difference: type () does not think that is a subclass of parent class type, without regard to inheritance. isinstance () will be considered sub-class is a parent class type, consider inheritance.

  Syntax: isinstance ( Object , ClassInfo ) >>>>>> Object - instance of an object. classinfo - may be directly or indirectly, class name, or basic types tuple composed thereof.

3.socket module is a network connection end points. For example, when you request a Web browser on a home page, your Web browser creates a socket and ordered it to connect to the Web server host, also on the Web server requests from eavesdropping on a socket. Ends using respective socket to send and receive information. socket that is a special kind of file, some of them operating socket function is carried out (read / write IO, open, close).

Guess you like

Origin www.cnblogs.com/Turing-dz/p/11401910.html