urllib practice of python reptile

Article Updated: 2020-03-19
Note: This article refer to the official documentation to explain the urllib.

A, urllib module introduction

urllib mainly in the following four modules for processing the URL.

1, urllib.request.py module

Note 1: Request ObjectsThe following attributes full_urlare: type, host, origin_req_host, selector, data, unverifiable, , method
Note 2: Request ObjectsThere are the following methods: get_method(), add_header(key,val), add_unredirected_header(key,header), has_header(header), remove_header(header), get_full_url(), set_proxy(host,type), get_header(header_name,default=None), header_items()
Note 3: The above property or method, may be defined in a Requestcall back object.

(1) urlopenFunction

urllib.request.urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT,
           		 *, cafile=None, capath=None, cadefault=False, context=None)

Note 1: for opening a urlcan is a URL string may be a Request object.
Note 2: It is returned http.client.HTTPResponseor urllib.response.addinfourlobject has the following methods:
Note 3: geturl () Method: See generally returned to the resource URL, determines whether the redirected.
Note 4: info () method: Back to the meta-information, such as headers.
Note 5: getcode () method: http response code is returned.

(2) Requestfunction

urllib.request.Request(url, data=None, headers={}, origin_req_host=None,
									 unverifiable=False, method=None)

Note 1: urlIt should be a valid URL string.
Note 2: datais a data object to be transmitted to the server.
Note 3: If it is posta method, datashould be the application/x-www-form-urlencodedformat.
Note 4: headersIt should be a dictionary.

(3) ProxyHandler(proxies=None)function

Note 1: The setting for the proxy.
Note 2: The port number :portis optional.

2, urllib.error.py module

In this module URLError, HTTPErrorand ContentTooShortError(msg,content).

3, urllib.parse.py module

This module is used to process the URL, you can be decomposed splice URL.
In general, a URL can be divided into six parts, for example: scheme://netloc/path;parameters?query#fragment
each part is a string, and some may be empty, but this is the minimum unit 6 can not be divided.

Figure zero
6 explained in the following parameters:

Figure 01

4, urllib.robotparser.py module

Second, the use of the module

1, get 1 web content

>>> import urllib.request
>>> with urllib.request.urlopen('http://www.python.org/') as f:
...     print(f.read(300))
...
b'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\n\n\n<html
xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">\n\n<head>\n
<meta http-equiv="content-type" content="text/html; charset=utf-8" />\n
<title>Python Programming '

Note 1: If the output of Chinese, here f.read(300)should be changed f.read(300).decode('utf-8')before they can.
Note 2: Because the urlopenreturns byte object that can not be automatically transcoding, so the Chinese words, it can be manually displayed properly.

import urllib.request
url = "http://www.baidu.com/"
response = urllib.request.urlopen(url)
print(response.read().decode('utf-8')

Note 1: The responseonly be read once (only print can be considered), read empty again.
Note 2: If the output only response, and not read(), then the result is such that the output:

Encoding the response process

2, get 2 web content

Note: Of course, you can also use the following this method.

>>> import urllib.request
>>> f = urllib.request.urlopen('http://www.python.org/')
>>> print(f.read(100).decode('utf-8'))
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtm

3, using basic HTTP authentication

import urllib.request
# Create an OpenerDirector with support for Basic HTTP Authentication...
auth_handler = urllib.request.HTTPBasicAuthHandler()
auth_handler.add_password(realm='PDQ Application',
                          uri='https://mahler:8092/site-updates.py',
                          user='klem',
                          passwd='kadidd!ehopper')
opener = urllib.request.build_opener(auth_handler)
# ...and install it globally so it can be used with urlopen.
urllib.request.install_opener(opener)
urllib.request.urlopen('http://www.example.com/login.html')

4, add headers

import urllib.request
req = urllib.request.Request('http://www.example.com/')
req.add_header('Referer', 'http://www.python.org/')
# Customize the default User-Agent header value:
req.add_header('User-Agent', 'urllib-example/0.1 (Contact: . . .)')
r = urllib.request.urlopen(req)

5, using parameters

>>> import urllib.request
>>> import urllib.parse
>>> data = urllib.parse.urlencode({'spam': 1, 'eggs': 2, 'bacon': 0})
>>> data = data.encode('ascii')
>>> with urllib.request.urlopen("http://requestb.in/xrbl82xr", data) as f:
...     print(f.read().decode('utf-8'))
...

6, using a proxy

>>> import urllib.request
>>> proxies = {'http': 'http://proxy.example.com:8080/'}
>>> opener = urllib.request.FancyURLopener(proxies)
>>> with opener.open("http://www.python.org") as f:
...     f.read().decode('utf-8')
...

7, in response to the code

Response Code Explanation
100 (‘Continue’, ‘Request received, please continue’)
101 (‘Switching Protocols’, ‘Switching to new protocol; obey Upgrade header’)
200 (‘OK’, ‘Request fulfilled, document follows’)
201 (‘Created’, ‘Document created, URL follows’)
202 (‘Accepted’, ‘Request accepted, processing continues off-line’)
203 (‘Non-Authoritative Information’, ‘Request fulfilled from cache’)
204 (‘No Content’, ‘Request fulfilled, nothing follows’)
205 (‘Reset Content’, ‘Clear input form for further input.’)
206 (‘Partial Content’, ‘Partial content follows.’)
300 (‘Multiple Choices’, ‘Object has several resources – see URI list’)
301 (‘Moved Permanently’, ‘Object moved permanently – see URI list’)
302 (‘Found’, ‘Object moved temporarily – see URI list’)
303 (‘See Other’, ‘Object moved – see Method and URL list’)
304 (‘Not Modified’, ‘Document has not changed since given time’)
305 (‘Use Proxy’, ‘You must use proxy specified in Location to access this resource.’)
307 (‘Temporary Redirect’, ‘Object moved temporarily – see URI list’)
400 (‘Bad Request’, ‘Bad request syntax or unsupported method’)
401 (‘Unauthorized’, ‘No permission – see authorization schemes’)
402 (‘Payment Required’, ‘No payment – see charging schemes’)
403 (‘Forbidden’, ‘Request forbidden – authorization will not help’)
404 (‘Not Found’, ‘Nothing matches the given URI’)
405 (‘Method Not Allowed’, ‘Specified method is invalid for this server.’)
406 (‘Not Acceptable’, ‘URI not available in preferred format.’)
407 (‘Proxy Authentication Required’, ‘You must authenticate with this proxy before proceeding.’)
408 (‘Request Timeout’, ‘Request timed out; try again later.’)
409 (‘Conflict’, ‘Request conflict.’)
410 (‘Gone’, ‘URI no longer exists and has been permanently removed.’)
411 (‘Length Required’, ‘Client must specify Content-Length.’)
412 (‘Precondition Failed’, ‘Precondition in headers is false.’)
413 (‘Request Entity Too Large’, ‘Entity is too large.’)
414 (‘Request-URI Too Long’, ‘URI is too long.’)
415 (‘Unsupported Media Type’, ‘Entity body in unsupported format.’)
416 (‘Requested Range Not Satisfiable’, ‘Cannot satisfy request range.’)
417 (‘Expectation Failed’, ‘Expect condition could not be satisfied.’)
500 (‘Internal Server Error’, ‘Server got itself in trouble’)
501 (‘Not Implemented’, ‘Server does not support this operation’)
502 (‘Bad Gateway’, ‘Invalid responses from another server/proxy.’)
503 (‘Service Unavailable’, ‘The server cannot process the request due to a high load’)
504 (‘Gateway Timeout’, ‘The gateway server did not receive a timely response’)
505 (‘HTTP Version Not Supported’, ‘Cannot fulfill request.’)

8, handle exceptions

from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
req = Request(someurl)
try:
    response = urlopen(req)
except HTTPError as e:
    print('The server couldn\'t fulfill the request.')
    print('Error code: ', e.code)
except URLError as e:
    print('We failed to reach a server.')
    print('Reason: ', e.reason)
else:
    # everything is fine
    ```
###  9、异常处理2
```python
from urllib.request import Request, urlopen
from urllib.error import URLError
req = Request(someurl)
try:
    response = urlopen(req)
except URLError as e:
    if hasattr(e, 'reason'):
        print('We failed to reach a server.')
        print('Reason: ', e.reason)
    elif hasattr(e, 'code'):
        print('The server couldn\'t fulfill the request.')
        print('Error code: ', e.code)
else:
    # everything is fine

Third, the summary

1、urllib.request.urlopen()

import urllib.request
url = "http://www.csdn.net/"
req = urllib.request.urlopen(url)
req.add_header('Referer','http://www.csdn.net')
req.add_header('User-Agent','chrome')

response = urllib.request.urlopen(req)
print(response.read(2000).decode('utf-8')

2, was added Form Data

import urllib.parse
import urllib.request
 
url = "http://httpbin.org/post"
data = bytes(urllib.parse.urlencode({'hello':'world'}),encoding='utf-8')
response = urllib.request.urlopen(url,data=data)
print(response.read().decode('utf-8'))

Plus data

3, acquiring response information

import urllib.request
url = "http://httpbin.org/"
response = urllib.request.urlopen(url)

response.info()
response.getcode()
response.geturl()

Get response information

4, save the data

If it is saved HTML:

import urllib.request
url = "http://httpbin.org/"
response = urllib.request.urlopen(url)
data = response.read()

with open("file.html","wb")as file:
	file.write(data)

· save data
If it is to save pictures and other data:

import urllib.request
url = "http://csdn.net/favicon.ico"
response = urllib.request.urlopen(url)
data = response.read()

with open("file.png","wb")as file:
	file.write(data)

image

Another method (note that the path to be present):

import urllib.request
url = "http://csdn.net/"

urllib.request.urlretrieve(url,filename=r"d:\test\index.html")

save data
Variable may be received by this statement return value is an array type, the data can be output.

5, handle exceptions

>>>import urllib.request
>>> for i in range(10):
	try:
		file = urllib.request.urlopen("http://zhihu.com",timeout=0.2)
		data = file.read()
		print(len(data))
	except Exception as e:
		print("有异常:"+str(e))

		

Handling Exceptions

四、Enjoy!

Published 75 original articles · won praise 8 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_21516633/article/details/104621945