Urllib is python's built-in HTTP request library,
including the following modules
urllib.request request module
urllib.error exception handling module
urllib.parse url parsing module
urllib.robotparser robots.txt parsing module
vacation
About the introduction of urllib.request.urlopen parameters:
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
Use of url parameters
Write a simple example first:
import urllib.request response = urllib.request.urlopen('http://www.baidu.com') print(response.read().decode('utf-8'))
There are three parameters commonly used in urlopen. Its parameters are as follows:
urllib.requeset.urlopen(url, data, timeout)
response.read() can get the content of the web page, if there is no read(), it will return the following content
Use of the data parameter
The above example is to obtain Baidu by requesting Baidu's get request. The post request using urllib is demonstrated
here through the http://httpbin.org/post website (this website can be used as a site to practice using urllib, which can
simulate various requests operate).
import urllib.parse import urllib.request data = bytes(urllib.parse.urlencode({'word': 'hello'}), encoding='utf8') print(data) response = urllib.request.urlopen('http://httpbin.org/post', data=data) print(response.read())
urllib.parse is used here, and the post data can be converted into the data parameter of urllib.request.urlopen through bytes(urllib.parse.urlencode()). This completes a post request.
So if we add the data parameter, it is a post request, and if there is no data parameter, it is a get request.
The use of the timeout parameter
will cause slow requests or abnormal requests in some network conditions or abnormal server-side conditions. Therefore, at this time, we need to
set a timeout period for the request instead of letting the program keep waiting for the result. Examples are as follows:
import urllib.request response = urllib.request.urlopen('http://httpbin.org/get', timeout=1) print(response.read())
After running, we can see that the result can be returned normally, then we set the timeout time to 0.1 and the
running program will prompt the following error
So we need to grab the exception and change the code to
import socket import urllib.request import urllib.error try: response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1) except urllib.error.URLError as e: if isinstance(e.reason, socket.timeout): print('TIME OUT')
response
Response type, status code, response headers
import urllib.request response = urllib.request.urlopen('https://www.python.org') print(type(response))
You can see that the result is: <class 'http.client.httpresponse'="">
We can get the status code and header information response through response.status, response.getheaders().response.getheader("server")
. read() gets the content of the response body
Of course, the above urlopen can only be used for some simple requests, because it cannot add some header information. If we write a crawler later, we can know that in many cases, we need to add header information to access the target station, which is used at this time. urllib.request
request
There are many websites for setting Headers
. In order to prevent the website from being paralyzed by the program crawler, it will need to carry some header information to access it. The most common one is the user-agent parameter.
Write a simple example:
import urllib.request request = urllib.request.Request('https://python.org') response = urllib.request.urlopen(request) print(response.read().decode('utf-8'))
Add header information to the request to customize the header information when you request the website
from urllib import request, parse url = 'http://httpbin.org/post' headers = { 'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)', 'Host': 'httpbin.org' } dict = { 'name': 'zhaofan' } data = bytes(parse.urlencode(dict), encoding='utf8') req = request.Request(url=url, data=data, headers=headers, method='POST') response = request.urlopen(req) print(response.read().decode('utf-8'))
The second way to add request headers
from urllib import request, parse url = 'http://httpbin.org/post' dict = { 'name': 'Germey' } data = bytes(parse.urlencode(dict), encoding='utf8') req = request.Request(url=url, data=data, method='POST') req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)') response = request.urlopen(req) print(response.read().decode('utf-8'))
This way of adding has the advantage that you can define a request header dictionary yourself, and then add it in a loop
Advanced usage of various handlers
Proxy, ProxyHandler
You can set a proxy through rulllib.request.ProxyHandler(), and the website will detect the number of visits of a certain IP in a certain period of time. If the number of visits is too many, it will prohibit your visit, so you need to set a proxy to crawl data at this time.
import urllib.request proxy_handler = urllib.request.ProxyHandler({ 'http': 'http://127.0.0.1:9743', 'https': 'https://127.0.0.1:9743' }) opener = urllib.request.build_opener(proxy_handler) response = opener.open('http://httpbin.org/get') print(response.read())
cookie,HTTPCookiProcessor
Our common login information is stored in cookies. Sometimes crawling websites need to carry cookie information to access. http.cookiejar is used here to obtain cookies and store cookies.
import http.cookiejar, urllib.request cookie = http.cookiejar.CookieJar() handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') for item in cookie: print(item.name+"="+item.value)
At the same time, cookies can be written to a file and saved. There are two ways: http.cookiejar.MozillaCookieJar and http.cookiejar.LWPCookieJar(). Of course, you can use either method yourself
Specific code examples are as follows:
http.cookiejar.MozillaCookieJar() method
import http.cookiejar, urllib.request filename = "cookie.txt" cookie = http.cookiejar.MozillaCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True)
http.cookiejar.LWPCookieJar()方式
import http.cookiejar, urllib.request filename = 'cookie.txt' cookie = http.cookiejar.LWPCookieJar(filename) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') cookie.save(ignore_discard=True, ignore_expires=True)
Similarly, if you want to obtain the cookie in the file, you can use the load method. Of course, whichever method is used to write, whichever method is used to read.
import http.cookiejar, urllib.request cookie = http.cookiejar.LWPCookieJar() cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True) handler = urllib.request.HTTPCookieProcessor(cookie) opener = urllib.request.build_opener(handler) response = opener.open('http://www.baidu.com') print(response.read().decode('utf-8'))
exception handling
Many times when we access pages through programs, some pages may have errors, such as 404, 500 and other errors,
we need to catch exceptions at this time. Let's write a simple example first.
from urllib import request,error try: response = request.urlopen("http://pythonsite.com/1111.html") except error.URLError as e: print(e.reason)
The above code accesses a page that does not exist. By catching the exception, we can print the exception error
What we need to know here is that there are two exception errors in the urllb exception:
URLError, HTTPError, HTTPError is a subclass of URLError
There is only one attribute in URLError: reason, that is, only error information can be printed when catching exceptions, similar to the above example
There are three attributes in HTTPError: code, reason, headers, that is, when catching exceptions, you can get three information of code, reson, and headers. Examples are as follows:
from urllib import request,error try: response = request.urlopen("http://pythonsite.com/1111.html") except error.HTTPError as e: print(e.reason) print(e.code) print(e.headers) except error.URLError as e: print(e.reason) else: print("reqeust successfully")
At the same time, e.reason can actually make in-depth judgments. Examples are as follows:
import socket from urllib import error,request try: response = request.urlopen("http://www.pythonsite.com/",timeout=0.001) except error.URLError as e: print(type(e.reason)) if isinstance(e.reason,socket.timeout): print("time out")
URL parsing
urlparse
The URL parsing functions focus on splitting a URL string into its components, or on combining URL components into a URL string.
urllib.parse.urlparse(urlstring, scheme='', allow_fragments=True)
Function one:
from urllib.parse import urlparse result = urlparse("http://www.baidu.com/index.html;user?id=5#comment") print(result)
The result is:
Here we can split the url address you pass in
and we can specify the protocol type:
result = urlparse(" www.baidu.com/index.html;user?id=5#comment",scheme="https" )
When splitting in this way, the protocol type part will be the part you specified. Of course, if your url already has a protocol in it, the protocol you specify through the scheme will not take effect.
urlunpars
In fact, the function is the opposite of that of urlparse. It is used for splicing. Examples are as follows:
from urllib.parse import urlunparse data = ['http','www.baidu.com','index.html','user','a=123','commit'] print(urlunparse(data))
The result is as follows
urljoin
This function is actually splicing, examples are as follows:
from urllib.parse import urljoin print(urljoin('http://www.baidu.com', 'FAQ.html')) print(urljoin('http://www.baidu.com', 'https://pythonsite.com/FAQ.html')) print(urljoin('http://www.baidu.com/about.html', 'https://pythonsite.com/FAQ.html')) print(urljoin('http://www.baidu.com/about.html', 'https://pythonsite.com/FAQ.html?question=2')) print(urljoin('http://www.baidu.com?wd=abc', 'https://pythonsite.com/index.php')) print(urljoin('http://www.baidu.com', '?category=2#comment')) print(urljoin('www.baidu.com', '?category=2#comment')) print(urljoin('www.baidu.com#comment', '?category=2'))
The result is:
From the results of splicing, we can see that the priority of the latter is higher than that of the former url when splicing
The urlencode
method can convert the dictionary into url parameters, the example is as follows
from urllib.parse import urlencode params = { "name":"zhaofan", "age":23, } base_url = "http://www.baidu.com?" url = base_url+urlencode(params) print(url)
The result is: