Article Updated: 2020-03-19
Note: This article refer to the official documentation to explain the urllib.
Article Directory
- A, urllib module introduction
- 1, urllib.request.py module
- 2, urllib.error.py module
- 3, urllib.parse.py module
- 4, urllib.robotparser.py module
- Second, the use of the module
- 1, get 1 web content
- 2, get 2 web content
- 3, using basic HTTP authentication
- 4, add headers
- 5, using parameters
- 6, using a proxy
- 7, in response to the code
- 8, handle exceptions
- Third, the summary
- 1、urllib.request.urlopen()
- 2, was added Form Data
- 3, acquiring response information
- 4, save the data
- 5, handle exceptions
- 四、Enjoy!
A, urllib module introduction
urllib mainly in the following four modules for processing the URL.
1, urllib.request.py module
Note 1: Request Objects
The following attributes full_url
are: type
, host
, origin_req_host
, selector
, data
, unverifiable
, , method
Note 2: Request Objects
There are the following methods: get_method()
, add_header(key,val)
, add_unredirected_header(key,header)
, has_header(header)
, remove_header(header)
, get_full_url()
, set_proxy(host,type)
, get_header(header_name,default=None)
, header_items()
Note 3: The above property or method, may be defined in a Request
call back object.
(1) urlopen
Function
urllib.request.urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
*, cafile=None, capath=None, cadefault=False, context=None)
Note 1: for opening a url
can is a URL string may be a Request object.
Note 2: It is returned http.client.HTTPResponse
or urllib.response.addinfourl
object has the following methods:
Note 3: geturl () Method: See generally returned to the resource URL, determines whether the redirected.
Note 4: info () method: Back to the meta-information, such as headers.
Note 5: getcode () method: http response code is returned.
(2) Request
function
urllib.request.Request(url, data=None, headers={}, origin_req_host=None,
unverifiable=False, method=None)¶
Note 1: url
It should be a valid URL string.
Note 2: data
is a data object to be transmitted to the server.
Note 3: If it is post
a method, data
should be the application/x-www-form-urlencoded
format.
Note 4: headers
It should be a dictionary.
(3) ProxyHandler(proxies=None)
function
Note 1: The setting for the proxy.
Note 2: The port number :port
is optional.
2, urllib.error.py module
In this module URLError
, HTTPError
and ContentTooShortError(msg,content)
.
3, urllib.parse.py module
This module is used to process the URL, you can be decomposed splice URL.
In general, a URL can be divided into six parts, for example: scheme://netloc/path;parameters?query#fragment
each part is a string, and some may be empty, but this is the minimum unit 6 can not be divided.
6 explained in the following parameters:
4, urllib.robotparser.py module
Second, the use of the module
1, get 1 web content
>>> import urllib.request
>>> with urllib.request.urlopen('http://www.python.org/') as f:
... print(f.read(300))
...
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html
xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\n\n<head>\n
<meta http-equiv="content-type" content="text/html; charset=utf-8" />\n
<title>Python Programming '
Note 1: If the output of Chinese, here f.read(300)
should be changed f.read(300).decode('utf-8')
before they can.
Note 2: Because the urlopen
returns byte object that can not be automatically transcoding, so the Chinese words, it can be manually displayed properly.
import urllib.request
url = "http://www.baidu.com/"
response = urllib.request.urlopen(url)
print(response.read().decode('utf-8')
Note 1: The response
only be read once (only print can be considered), read empty again.
Note 2: If the output only response
, and not read()
, then the result is such that the output:
2, get 2 web content
Note: Of course, you can also use the following this method.
>>> import urllib.request
>>> f = urllib.request.urlopen('http://www.python.org/')
>>> print(f.read(100).decode('utf-8'))
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtm
3, using basic HTTP authentication
import urllib.request
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
uri='https://mahler:8092/site-updates.py',
user='klem',
passwd='kadidd!ehopper')
opener = urllib.request.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib.request.install_opener(opener)
urllib.request.urlopen('http://www.example.com/login.html')
4, add headers
import urllib.request
req = urllib.request.Request('http://www.example.com/')
req.add_header('Referer', 'http://www.python.org/')
# Customize the default User-Agent header value:
req.add_header('User-Agent', 'urllib-example/0.1 (Contact: . . .)')
r = urllib.request.urlopen(req)
5, using parameters
>>> import urllib.request
>>> import urllib.parse
>>> data = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> data = data.encode('ascii')
>>> with urllib.request.urlopen("http://requestb.in/xrbl82xr", data) as f:
... print(f.read().decode('utf-8'))
...
6, using a proxy
>>> import urllib.request
>>> proxies = {'http': 'http://proxy.example.com:8080/'}
>>> opener = urllib.request.FancyURLopener(proxies)
>>> with opener.open("http://www.python.org") as f:
... f.read().decode('utf-8')
...
7, in response to the code
Response Code | Explanation |
---|---|
100 | (‘Continue’, ‘Request received, please continue’) |
101 | (‘Switching Protocols’, ‘Switching to new protocol; obey Upgrade header’) |
200 | (‘OK’, ‘Request fulfilled, document follows’) |
201 | (‘Created’, ‘Document created, URL follows’) |
202 | (‘Accepted’, ‘Request accepted, processing continues off-line’) |
203 | (‘Non-Authoritative Information’, ‘Request fulfilled from cache’) |
204 | (‘No Content’, ‘Request fulfilled, nothing follows’) |
205 | (‘Reset Content’, ‘Clear input form for further input.’) |
206 | (‘Partial Content’, ‘Partial content follows.’) |
300 | (‘Multiple Choices’, ‘Object has several resources – see URI list’) |
301 | (‘Moved Permanently’, ‘Object moved permanently – see URI list’) |
302 | (‘Found’, ‘Object moved temporarily – see URI list’) |
303 | (‘See Other’, ‘Object moved – see Method and URL list’) |
304 | (‘Not Modified’, ‘Document has not changed since given time’) |
305 | (‘Use Proxy’, ‘You must use proxy specified in Location to access this resource.’) |
307 | (‘Temporary Redirect’, ‘Object moved temporarily – see URI list’) |
400 | (‘Bad Request’, ‘Bad request syntax or unsupported method’) |
401 | (‘Unauthorized’, ‘No permission – see authorization schemes’) |
402 | (‘Payment Required’, ‘No payment – see charging schemes’) |
403 | (‘Forbidden’, ‘Request forbidden – authorization will not help’) |
404 | (‘Not Found’, ‘Nothing matches the given URI’) |
405 | (‘Method Not Allowed’, ‘Specified method is invalid for this server.’) |
406 | (‘Not Acceptable’, ‘URI not available in preferred format.’) |
407 | (‘Proxy Authentication Required’, ‘You must authenticate with this proxy before proceeding.’) |
408 | (‘Request Timeout’, ‘Request timed out; try again later.’) |
409 | (‘Conflict’, ‘Request conflict.’) |
410 | (‘Gone’, ‘URI no longer exists and has been permanently removed.’) |
411 | (‘Length Required’, ‘Client must specify Content-Length.’) |
412 | (‘Precondition Failed’, ‘Precondition in headers is false.’) |
413 | (‘Request Entity Too Large’, ‘Entity is too large.’) |
414 | (‘Request-URI Too Long’, ‘URI is too long.’) |
415 | (‘Unsupported Media Type’, ‘Entity body in unsupported format.’) |
416 | (‘Requested Range Not Satisfiable’, ‘Cannot satisfy request range.’) |
417 | (‘Expectation Failed’, ‘Expect condition could not be satisfied.’) |
500 | (‘Internal Server Error’, ‘Server got itself in trouble’) |
501 | (‘Not Implemented’, ‘Server does not support this operation’) |
502 | (‘Bad Gateway’, ‘Invalid responses from another server/proxy.’) |
503 | (‘Service Unavailable’, ‘The server cannot process the request due to a high load’) |
504 | (‘Gateway Timeout’, ‘The gateway server did not receive a timely response’) |
505 | (‘HTTP Version Not Supported’, ‘Cannot fulfill request.’) |
8, handle exceptions
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request(someurl)
try:
response = urlopen(req)
except HTTPError as e:
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
except URLError as e:
print('We failed to reach a server.')
print('Reason: ', e.reason)
else:
# everything is fine
```
### 9、异常处理2
```python
from urllib.request import Request, urlopen
from urllib.error import URLError
req = Request(someurl)
try:
response = urlopen(req)
except URLError as e:
if hasattr(e, 'reason'):
print('We failed to reach a server.')
print('Reason: ', e.reason)
elif hasattr(e, 'code'):
print('The server couldn\'t fulfill the request.')
print('Error code: ', e.code)
else:
# everything is fine
Third, the summary
1、urllib.request.urlopen()
import urllib.request
url = "http://www.csdn.net/"
req = urllib.request.urlopen(url)
req.add_header('Referer','http://www.csdn.net')
req.add_header('User-Agent','chrome')
response = urllib.request.urlopen(req)
print(response.read(2000).decode('utf-8')
2, was added Form Data
import urllib.parse
import urllib.request
url = "http://httpbin.org/post"
data = bytes(urllib.parse.urlencode({'hello':'world'}),encoding='utf-8')
response = urllib.request.urlopen(url,data=data)
print(response.read().decode('utf-8'))
3, acquiring response information
import urllib.request
url = "http://httpbin.org/"
response = urllib.request.urlopen(url)
response.info()
response.getcode()
response.geturl()
4, save the data
If it is saved HTML:
import urllib.request
url = "http://httpbin.org/"
response = urllib.request.urlopen(url)
data = response.read()
with open("file.html","wb")as file:
file.write(data)
·
If it is to save pictures and other data:
import urllib.request
url = "http://csdn.net/favicon.ico"
response = urllib.request.urlopen(url)
data = response.read()
with open("file.png","wb")as file:
file.write(data)
Another method (note that the path to be present):
import urllib.request
url = "http://csdn.net/"
urllib.request.urlretrieve(url,filename=r"d:\test\index.html")
Variable may be received by this statement return value is an array type, the data can be output.
5, handle exceptions
>>>import urllib.request
>>> for i in range(10):
try:
file = urllib.request.urlopen("http://zhihu.com",timeout=0.2)
data = file.read()
print(len(data))
except Exception as e:
print("有异常:"+str(e))