The Urllib library is a Python module for manipulating URLs. In Python3, the Urllib2 and Urllib libraries in Python2.X are combined to become the Urllib library
Crawl web pages via Urllib
import urllib.request
file=urllib.request.urlopen("http://www.baidu.com")
data=file.read()
dataline=file.readline()
#print(dataline)
#print(data)
fhandle=open("D:/1.html","wb")
fhandle.write(data)
fhandle.close()
filname=urllib.request.urlretrieve("http://www.baidu.com",filename="D:/2.html")
urllib.request.urlcleanup()
print(file.info)
file.read() reads the entire content of the file and returns a string variable
file.readlines() returns the entire content to a list variable, this is the recommended way to read the full text
file.readline() reads a line of the file
The urlretrieve() function can directly download the webpage to the local directory, and the directory should be built in advance
emulated browser
The urlopen() function of Urllib does not support some advanced functions of HTTP. The following two functions are generally used to modify the header:
1. urllib.request.build_opener()
2. urllib.request.add_header()
urllib.request.build_opener() :
Set the dictionary headers={'xxx':'yyyy'}
opener=urllib.request.build_opener() #创建自定义opener对象
opener.addheaders=[headers] # 设置好头信息
data=opener.open(url).read()
urllib.request.add_header() :
req=urllib.request.Request(url) #创建 Request 对象
req.add_header('字段名':'字段值') #设置报头
data=urllib.request.urlopen(req).read()
Proxy server settings:
import urllib.request
def use_proxy(proxy_addr, url):
kv = {
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36'}
kv = urllib.parse.urlencode(kv).encode('utf-8') #转码
proxy = urllib.request.ProxyHandler({'http': proxy_addr})
opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
urllib.request.install_opener(opener)
data = urllib.request.urlopen(url, data=kv).read().decode('utf-8')
return data
proxy_addr = '某个代理地址'
url = 'http://www.baidu.com'
data = use_proxy(proxy_addr, url)
print(data[-50:])
Note the header conversion to utf-8
and the settings of the proxy server
The first parameter is the proxy information, and the second parameter is the urllib.request.HTTPHandler class
proxy = urllib.request.ProxyHandler({'http': proxy_addr}) #设置代理信息
opener = urllib.request.build_opener(proxy, urllib.request.HTTPHandler)
urllib.request.install_opener(opener)
DebugLog in action
Nothing to say, print the debug log,
set the parameters, or create an opener object
#边爬边打印debuglog
import urllib.request
httphd=urllib.request.HTTPHandler(debuglevel=1)
httpshd=urllib.request.HTTPSHandler(debuglevel=1)
opener=urllib.request.build_opener(httphd,httpshd)
urllib.request.install_opener(opener)
data=urllib.request.urlopen('http://www.baidu.com')
print(data.read()[-50:])
URLError in action
The main reasons for generating URLError are:
- Can't connect to server
- Remote URL does not exist
- No network
- Subclass HTTPError is fired
Briefly describe the status code and meaning of HTTPError:
- 200 (Success) The server has successfully processed the request. Typically, this means that the server served the requested web page.
- 301 (Moved Permanently) The requested web page has been permanently moved to a new location. When the server returns this response (to a GET or HEAD request), it automatically redirects the requester to the new location.
- 302 (Moved temporarily) The server is currently responding to the request from a web page in a different location, but the requester should continue to use the original location for future requests.
- 304 (Not Modified) The requested page has not been modified since the last request. When the server returns this response, no web page content is returned.
- 403 (Forbidden) The server rejected the request.
- 404 (Not Found) The server cannot find the requested web page.
- 500 (Internal Server Error) The server encountered an error and could not complete the request.
501 (Not yet implemented) The server is not capable of fulfilling the request. This code may be returned, for example, when the server does not recognize the request method.
Code:
import urllib.request
import urllib.error
try:
urllib.request.urlopen('https://www.google.com', timeout=1) #故意设置到某网站
except urllib.error.URLError as e:
if hasattr(e, 'code'):
print(e.code)
if hasattr(e, 'reason'):
print(e.reason)
'''
if '测试' in 对象:
这种形式必须要可迭代的对象才能使用
if 'reason' in e: # Wrong!
print(e.reason)
'''
It should be noted that HTTPError is a subclass of URLError, and HTTPError has no code method. Of course, it can be solved by writing two except:, the method in the code is also possible, and then the use of in must be used in iterable objects, not casual Use (:з"∠)