urllib notes

In the Python 3, it was incorporated into urllib2 in urllib called urllib.request and urllib.error. 
urllib entire module is divided into urllib.request, urllib.parse, urllib.error.

The HTTP request method:
The standard HTTP, HTTP request can use several request methods.
HTTP1.0 request defines three methods: GET, POST, and HEAD method.
HTTP1.1 six new request methods: OPTIONS, PUT, PATCH, DELETE , TRACE , and CONNECT method.

1 GET request page information specified, and returns the entity body. (Acquisition request from the server)
2 is similar to the HEAD GET request, but the response is not returned in the specific content, for obtaining the header
3 POST submit data processing request to the specified resource (e.g., file submission form or upload). Data contained in the request body. POST request may result in a revision to establish and / or existing resources to new resources. (Send data to the server)
. 4 substituents specified in the PUT content of the document from the client to the server data transmission.
5 DELETE request to the server to delete the specified page.
6 CONNECT HTTP / 1.1 protocol can be reserved for connection to the proxy server pipeline mode.
7 OPTIONS allows the client to view server performance.
8 TRACE echo server requests received, mainly used for testing or diagnosis.
9 PATCH complementary PUT method is used to update local resources known.

Detailed url:
the URI: Uniform Resource Identifier (Uniform Resource Identifiers), and establishing a connection used to transmit data.
. URL: Uniform Resource Locator (Uniform Resource Locator)
The general format of the URL is (in square brackets [] are optional): protocol: // hostname [: port] / path / [; parameters] # [query?] the fragment
1, scheme (protocol): Specifies the transport protocol used, the most common is the HTTP protocol, which is currently the most widely used WWW agreement.
file resource is a file on the local computer. Format File: //
the FTP via FTP to access resources. Format the FTP: //
Gopher access the resource through the Gopher protocol.
http access the resource via HTTP. Format HTTP: //
HTTPS to access the resource through a secure HTTPS. Format = _blank target> HTTPS: //

2, Host (host name, domain name): refers to the domain name system server storage resources (DNS) host name or IP address.
3, port (port number): integer, optional, default browser 80 port
4, path (path): zero or more by a "/" symbol string spaced, generally used to indicate a directory on the host or file address
5, parameters (parameter): This parameter is used to specify special options.
6, query-string (query string): Alternatively, for a dynamic web page (e.g., using technology to produce CGI, ISAPI, PHP / JSP / ASP / ASP.NET pages, etc.) to pass parameters, there may be multiple parameters, with the "&" symbol spaced, name and value of each parameter by "=" symbol spaced.
7, anchor (anchor): Background generally do not control the tip used for page navigation

request header common parameters:

the http protocol, the server sends a request, the data is divided into three parts, the first one is to put the data url, the second body is placed in the data (in the post request), the third data is taught in the head, the number described in the crawlers are often used herein Sleeper some request header:
1 . User-Agent: browser name.
This often is used in crawling into the net. Requests a Web page, the server via this parameter can know what the request was sent by Liu's browser. If we are sending a request by a worm, and then our User-Agent is pyton, which for those sites have built anti-worm mechanism, the production can easily break your request reptile. Therefore, we should always set this value to the value of some browsers, to disguise our crawlers.
2. Referer: that the current url from which the request is over. This can also be used for general anti-crawler technology. If it is not coming from the specified page, then do not do the relevant response.
3.Cookie: http protocol is stateless. That is the same person twice to send the request, the server is not capable of knowing whether these two please Song from the same individual. So this time the models do identify with a cookie. A general station if you want to do after login to access, then teach the need to send the cookie information.

HTTP status code (HTTP Status Code) is a 3-digit code to represent the state of the web server HTTP response.
Which consists of an RFC 2616 defined, and with RFC 2518, RFC 2817, RFC 2295 , RFC 2774, RFC 4918 specification extends the like.
Message 1xx
2xx success
3xx Redirection
4xx Request Error
5xx 6xx server error
(common status code: 200,301,302,400,403,500)
1 beginning with http status code
status code indicates a provisional response and require the requestor to continue with the operations of .

100 (continued) requestor should continue with the request. The server returns this code indicates a first part of the request has been received and is waiting for the rest.
101 (Switching protocols) The requestor has switched protocol requires the server, the server is acknowledging that handover.

2 beginning with http status code
indicates a successful request

200 successfully processed the request, under normal circumstances are returned this status code;
201 requests the server successfully and creates a new resource.
202 but did not accept the request to create a resource;
203 returns the request to another resource;
204 server successfully processed the request, but not returning any content;
205 server successfully processed the request, but not returning any content;
206 request processing section;

3xx (weight directional)
redirect code, it is also a common code

300 (choice) for the request, the server may perform various operations. The server may operate a requestor (user agent) selected in accordance with, or provide a list of actions for the requester to select.
301 (permanent redirect) permanently requested page has been moved to a new location. When the server returns this response (response to a GET or HEAD request), it automatically forwards the requestor to the new location.
302 (Temporary Mobile) respond to requests from the web server is currently a different location, but the requestor should continue to use the original location for future requests.
303 (See other location) requestor to retrieve the response should separate GET request to a different location, the server returns this code.
304 (Not Modified) since the last request, the requested page has not been modified. When the server returns this response, it does not return to the page content.
305 (use proxy) The requestor can only access the requested page using a proxy. If the server returns this response, it said requester use the agent.
307 (temporary redirect) server response to a request from a web page different from the current position, but the requestor should continue to use the original location for future requests.

4 beginning with http status code indicates request error

400 server not understand the syntax of the request.
401 The request requires authentication. For pages that require login, the server might return this response.
The server 403 rejects the request.
404 server can not find the requested page.
The method specified in the request 405 is disabled.
406 can not be used in response to a request of the content characteristics requested page.
This status code 407 is similar to 401, but specifies that the requestor should be authorized to use the proxy.
A timeout occurred while waiting for the request 408 server.
Conflict occurs when the server 409 to complete the request. The server must include information about the conflict in the response.
410 If the requested resource has been permanently removed, the server returns this response.
Server 411 does not accept the request without a valid Content-Length header field.
Wherein the server 412 is not a prerequisite for the requestor in the request is satisfied.
413 server can not process the request because it is too large for the server to handle.
URI (typically, a URL) 414 request too long, the server can not handle.
415 requests in a format not request support page.
If you can not provide the range of 416 page request, the server returns this status code.
The server 417 does not meet the "expect" header field of the request requires.

5 beginning status code is not common, but we should know

500 (Internal Server Error) The server encountered an error and can not fulfill the request.
501 (Not implemented) server does not have to complete the request. For example, the server might not return this code request identification method.
502 (Bad gateway) as a gateway or proxy server received an invalid response from the upstream server.
503 (Service Unavailable) server is currently unavailable (because it is overloaded or down for maintenance). Typically, this is a temporary state.
504 (Gateway Timeout) as a gateway or a proxy server, but the request is not received from the upstream server.
505 (HTTP Version not supported) server does not support HTTP protocol version used in the request.


Packet capture tool:
Elements: What are the current Web page source code consisting of
Console: JS code output terminal of
the Sources: the pages of files that
the Network: See page request sent (see request that way, requested content)


requested library:
(into the built-in function: Ctrl + Ctrl + B or the left mouse button)
urllib library: (python3 urllib.request built into the collection of modules, import library functions: import from urllib Request)
the urlopen function:
the urlopen (URL, Data = None, socket._GLOBAL_DEFAULT_TIMEOUT timeout =, *, = None cafile is, pointed to by capath = None , = False cadefault, context = None):
URL: URL requested
data: Data request, if this value is then set to support post request
timeout: delay
return value: returns a value http.client.HTTPResponse object that is a handle to the object.
read (size) is not specified read all the contents of the file
readline: row upon
readlines: read multiple rows of data
getcode: status code
urlretrieve function:
This method can easily be saved to a file on a web page locally, save pictures, etc.
parse_qs (QS) Print (the Result) urlparse and urlsplit, basically the same, urlparse one more params. from urllib import parse
















= URL 'http://www.imailtone.com/WebApplication1/WebForm1.aspx;hello?name=tom&;age=20#resume'
resault = parse.urlparse (URL)
Print ( 'scheme:', resault.scheme)
Print ( 'netloc:', resault.netloc)
Print ( 'path:', resault.path)
Print ( 'the params:', resault.params)
Print ( 'Query:', resault.query)
Print ( 'the fragment:' , resault.fragment)

request.Request class:
If you want to request an increase in the request header, it is necessary to implement the use of request.Request class, such as adding a - Agent-the User,
from the urllib Import request, the parse
URL = 'HTTPS: // www.lagou.com/jobs/list_python/p-city_0?&cl=false&fromSearch=true&labelWords=&suginput= '
# RESP = request.urlopen (URL)
# Print (RESP.read())
headers = {
'- Agent-the User': 'the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-) AppleWebKit / 537.36 (KHTML, like the Gecko) the Chrome / 73.0.3683.86 Safari / 537.36',
'the Referer': 'HTTPS: //www.lagou ? .com / Jobs / list_python / P-city_0 & Cl = to false & fromSearch = to true & labelWords = & suginput = ',}
Data = {
' First ':' to true ',
' PN ':'. 1 ',
' KD ':' Python '}
REQ = request.Request (url, headers = headers , data = parse.urlencode (data) .encode ( 'utf-8'), method = 'POST') # Add User-Agent, Referer, data and parameter data encoding
= request.urlopen RESP (REQ)
Print (resp.read (). decode ( 'UTF-. 8')) # decoded

ProxyHandler processor (IP proxy settings)
using 'http://www.httpbin.org/' site Http request to view some of the parameters
What is the cookie: Cookie, sometimes also used the plural form of Cookies, refers to data (usually encrypted) certain websites Abuse identity, a session tracked and stored on the user's local terminal. Is defined in RFC2109 and 2965 have been abandoned, replaced by the latest specification is RFC6265. Name: The name of the Cookie. Value: The value of Cookie. Domain: Cookie can access the domain name. Expires: Cookie the dead time, in seconds,














Path: Path of the use of Cookie.
Secure: The Cookie is used only if the security protocol. HTTPS and SSL security protocols have such data before the first transmission of data over the network encryption. The default is false.

Example. 1:
from the urllib Import Request
URL = 'http://www.renren.com/880151247/profile'
headers = {
'the User-Agent': 'the Mozilla / 5.0 (the Windows NT 10.0; Win64; x64-; RV: 73.0) Gecko / 20100101 Firefox / 73.0 ',
'Cookie':'wp_fold=0; _r01_=1; anonymid=k6q3nelcabojoo; depovince=LN; taihe_bi_sdk_uid=09ecf7537e2a34b628e281831b84b42d; jebe_key=2c043046-81f7-4d1f-967b-311f5ced93f4%7Cd39a42f8e30da355ea6608fb51f73574%7C1581922246738%7C1%7C1581922246672; jebe_key=2c043046-81f7-4d1f-967b-311f5ced93f4%7Cd39a42f8e30da355ea6608fb51f73574%7C1581922246738%7C1%7C1581922246674; jebecookies=c773ae43-1e0b-49e7-968f-09395211d019|||||; JSESSIONID=abcfYtpckGeCT5Oaw-rbx; taihe_bi_sdk_session=a92b80af092a72dea469264ef24c32a1; ick_login=42131da7-55ae-47c7-a142-740f0df95f89; t=abc15a448e816609aad40fdb911941d27; societyguester=abc15a448e816609aad40fdb911941d27; id=973744147; xnsid=adfefda2; ver=7.0; loginfrom=null',
}
req = request.Request(url,headers=headers)
resp = request.urlopen(req)
Print # (. Resp.read () decode ( 'UTF-. 8'))
with Open ( 'dapeng.html', 'W', encoding = 'UTF-. 8') AS FP:
#write must be written str type data
# resp.read () is read out of a data type bytes
#str -> encode -> bytes
#bytws -> decode -> STR
. fp.write (resp.read () decode ( 'UTF-. 8' ))
Example 2: use the account password


http.cookjar module:
CookieJar (stored in memory), FileCookieJar, MozillaCookieJar, LWPCookieJar (stored in a file).
CookieJar: Management HTTP cookie value, store HTTP request generated by the cookie, the outgoing HTTP request object to add a cookie. The entire cookie is stored in memory, for instance cookie CookieJar after garbage collection will also be lost.
FileCookieJar (filename, delayload = None, policy = None): derived from CookieJar for creating FileCookieJar instance, and retrieve cookie information stored in the cookie file. filename is stored in the cookie file name. delayload supports file access latency access to True, that is, only when needed to read files or data stored in a file.
MozillaCookieJar (filename, delayload = None, policy = None): derived from FileCookieJar, create FileCookieJar instance cookies.txt compatible with the Mozilla browser.
LWPCookieJar (filename, delayload = None, policy = None): Compatible derived from FileCookieJar, with libwww-perl create standard Set-Cookie3 file format FileCookieJar instance.
In fact, in most cases, we only use CookieJar (), if you need local files and interact, use MozillaCookjar () or LWPCookieJar (). Of course, if we need to have a customized cookie, then we have to deal with the aid of HTTPCookieProcess processor. See specific code below.

Save the cookie to local:
from urllib Import Request
from http.cookiejar Import MozillaCookieJar

CookieJar = MozillaCookieJar ( 'cookie1.txt')
cookiejar.load (ignore_discard = True) # load the cookie information call
Handler = request.HTTPCookieProcessor (CookieJar)
opener = request.build_opener (handler)
Opener.open RESP = # ( "http://www.baidu.com")
RESP = opener.open ( "http://www.httpbin.org/cookies/set?count=spider")
# cookiejar.save ( ) # If cookijar name is not set to fill in again to save the file name
# cookiejar.save (ignore_discard = True) # save the expiring cookie information
for cookie in CookieJar:
Print (cookie)


















Guess you like

Origin www.cnblogs.com/baoshijie/p/12363965.html