If you don’t understand crawlers, please see the first lesson of link
crawlers: the basic principles of crawlers
What is Urllib
Urllib is a built-in http request library of python. Contains four modules
- urllib.request request module
- urllib.error exception handling module
- urllib.parse url parsing module
- urllib.robotparser robots.txt parsing module (identify which websites can be crawled)
This article mainly explains the first three modules
Difference from urlib module in python2
Python2
import urllib2
response = urllib2.urlopen(‘http://www.baidu.com’)
Python3
import urllib.request
response = urllib.request.urlopen(‘http://www.baidu.com’)
Usage explanation
Request urlopen
urlopen
The first parameter URL requests the website, the second parameter data is required for post request, and the third parameter timeout is set for a limited time, and will not be used temporarily.
#%% raw
urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)
get request webpage open
import urllib.request
response = urllib.request.urlopen('https://www.baidu.com/')
print(response.read().decode('utf-8'))#解码
post request webpage to open
import urllib.request
import urllib.parse
data = bytes(urllib.parse.urlencode({
'word': 'hello'}), encoding='utf8')#编码
response = urllib.request.urlopen('http://httpbin.org/post', data=data)
print(response.read())
timeout
sets a timeout, if the request is not completed within the specified time, an error will be reported
import urllib.request
response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
print(response.read())
import socket
import urllib.request
import urllib.error
try:
response = urllib.request.urlopen('http://httpbin.org/get', timeout=0.1)
except urllib.error.URLError as e:
if isinstance(e.reason, socket.timeout):
print('TIME OUT')
response
Response type
import urllib.request
response = urllib.request.urlopen('https://www.python.org')
print(type(response))
Status code and response headers The
response includes two important information: status code and response headers
import urllib.request
response = urllib.request.urlopen('https://www.python.org')
print(response.status)#状态码
print(response.getheaders())#响应头
print(response.getheader('Server'))#查看响应头具体某一项
Read response read
#The read response is byte stream data, which needs to be decoded as utf-8
import urllib.request
response = urllib.request.urlopen('https://www.python.org')
print(response.read().decode('utf-8'))
Request
through the previous urlopen we can request a web page and get the response, but urlopen cannot add additional information, such as request headers, so a more advanced request is required, Request
Read webpage
import urllib.request
request = urllib.request.Request('https://python.org')#声明一个request对象
response = urllib.request.urlopen(request)
print(response.read().decode('utf-8'))
Add request header and read webpage in post mode at the same time
from urllib import request, parse
url = 'http://httpbin.org/post'
headers = {
'User-Agent': 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)',
'Host': 'httpbin.org'
}
dict = {
'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')#编码
req = request.Request(url=url, data=data, headers=headers, method='POST')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
Adding the request header in front is through the parameter, you can also use the add_header method
from urllib import request, parse
url = 'http://httpbin.org/post'
dict = {
'name': 'Germey'
}
data = bytes(parse.urlencode(dict), encoding='utf8')
req = request.Request(url=url, data=data, method='POST')
req.add_header('User-Agent', 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)')
response = request.urlopen(req)
print(response.read().decode('utf-8'))
Handler proxy
import urllib.request
proxy_handler = urllib.request.ProxyHandler({
'http': 'http://127.0.0.1:9743',#服务器网址
'https': 'https://127.0.0.1:9743'
})
opener = urllib.request.build_opener(proxy_handler)
response = opener.open('http://httpbin.org/get')
print(response.read())
Cookie
Cookie is a text file that is saved on the client and used to record the user's identity
If we open the Taobao webpage, the user name information is displayed in the upper left corner
Review the elements, find Taobao cookie, right click to clear
Refresh the webpage user login information disappeared
If there is a cookie, it can maintain the user's login information. When doing a crawler, it can crawl some login information.
1. Print cookies
import http.cookiejar, urllib.request
cookie = http.cookiejar.CookieJar()#声明cookie对象
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')#百度,我登录了百度账号
for item in cookie:
print(item.name+"="+item.value)
2. Save the cookie as a text file
Way 1
import http.cookiejar, urllib.request
filename = "cookie.txt"
cookie = http.cookiejar.MozillaCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')
cookie.save(ignore_discard=True, ignore_expires=True)
Way 2
import http.cookiejar, urllib.request
filename = 'cookie.txt'
cookie = http.cookiejar.LWPCookieJar(filename)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')
cookie.save(ignore_discard=True, ignore_expires=True)
3. Read web pages with cookies
import http.cookiejar, urllib.request
cookie = http.cookiejar.LWPCookieJar()#要和前面保存方式一样
cookie.load('cookie.txt', ignore_discard=True, ignore_expires=True)
handler = urllib.request.HTTPCookieProcessor(cookie)
opener = urllib.request.build_opener(handler)
response = opener.open('http://www.baidu.com/')
print(response.read().decode('utf-8'))
Exception handling
from urllib import request, error
try:
response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.URLError as e:
print(e.reason)
from urllib import request, error
try:
response = request.urlopen('http://cuiqingcai.com/index.htm')
except error.HTTPError as e:
print(e.reason, e.code, e.headers, sep='\n')
except error.URLError as e:
print(e.reason)
else:
print('Request Successfully')
import socket
import urllib.request
import urllib.error
try:
response = urllib.request.urlopen('https://www.baidu.com', timeout=0.01)
except urllib.error.URLError as e:
print(type(e.reason))
if isinstance(e.reason, socket.timeout):
print('TIME OUT')
URL parsing
urlparse
passes in a url, and then divides a url into several parts. Protocol, domain name, parameters, etc.
urllib.parse.urlparse(urlstring, scheme=’’, allow_fragments=True)
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment')
print(type(result), result)
Specify the protocol type, the incoming ur has no protocol
from urllib.parse import urlparse
result = urlparse('www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)
Specify the default protocol as ihttps, if the incoming url has no protocol, it will be designated as https
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', scheme='https')
print(result)
Anchor splicing allow_fragments
from urllib.parse import urlparse
result = urlparse('http://www.baidu.com/index.html;user?id=5#comment', allow_fragments=False)#allow_fragments锚点
print(result)
urlunparse: splicing URL
from urllib.parse import urlunparse
data = ['http', 'www.baidu.com', 'index.html', 'user', 'a=6', 'comment']
print(urlunparse(data))
urljoin
from urllib.parse import urljoin
print(urljoin('http://www.baidu.com', 'FAQ.html'))
print(urljoin('http://www.baidu.com', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html'))
print(urljoin('http://www.baidu.com/about.html', 'https://cuiqingcai.com/FAQ.html?question=2'))
print(urljoin('http://www.baidu.com?wd=abc', 'https://cuiqingcai.com/index.php'))
print(urljoin('http://www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com', '?category=2#comment'))
print(urljoin('www.baidu.com#comment', '?category=2'))
urlencode
converts dictionary objects into get request parameters
from urllib.parse import urlencode
params = {
'name': 'germey',
'age': 22
}
base_url = 'http://www.baidu.com?'
url = base_url + urlencode(params)
print(url)
Author: Electrical - Yudeng Wu.
It’s not easy to create, big guys, please stay... Come and like it before you go (๑◕櫫←๑)