The python reptile urllib library

Article Updated: 2020-03-02
Note: The code examples from teachers to teach with.

First, the acquaintance urllib library

In python2.x version, urllib with urllib2 are two libraries in python3.x version, the two are combined to urllib.

Second, the practice urllib library

1, and outputs the page crawling

'''
初识urllib库，如何使用urllib库爬取一个网页
  1、导入urllib.request模块
  2、使用urllib.request.urlopen()方法打开并爬取一个网页
  3、使用response.read()读取网页内容，并以utf-8格式进行转码
'''

import urllib.request         # 导入urllib.request库
response = urllib.request.urlopen("http://httpbin.org")   #爬取'http://httpbin.org'网页
print(response.read().decode('utf-8'))        # 打印以'utf-8'转码后的爬取结果

'''
urlopen方法，该方法有三个常用参数
urllib.request.urlopen（url，data，timeout）
  url表示需要打开的网址；
  data表示访问网址时需要传送的数据，一般在使用POST请求时使用；
  timeout是设置网站的访问超时时间。
'''

Code execution results are as follows:

Figure I

2, post data submission

'''
使用urllib库中的POST方法获取网页
'''

import urllib.parse       # urllib.parse 为url解析模块
import urllib.request
# urlencode的参数是字典，他可以将key-value这样的键值对转换成需要的格式
data = bytes(urllib.parse.urlencode({'word':'hello'}),encoding ='utf-8')
print(data)
response = urllib.request.urlopen('http://httpbin.org/post',data = data) 
print(response.read().decode('utf-8'))

Code execution results are as follows:

Figure II

3 Save the file as a result crawling

# 将爬取的网页保存为本地文件

import urllib.request
response = urllib.request.urlopen("http://httpbin.org")
data = response.read()
filehandle = open('D:/1.html',"wb")   # 通过open()函数以wb（二进制写入）的方式打开文件
filehandle.write(data)
filehandle.close

'''
代码中首先通过open()函数以wb（二进制写入）的方式打开文件，
打开后再将其赋值给变量filehandle，
然后再用write()方法将爬取的data数据写入打开的文件中，
写入完成后使用close()方法关闭该文件，使其不能再进行读写操作，程序到此结束。
'''

'''
# 还可以使用urllib.request中的urlretrieve()方法直接将对应信息写入本地文件，具体代码如下
import urllib.request
filename = urllib.request.urlretrieve("http://httpbin.org",
    filename = "D:/1.html")
'''

Code execution results:

Figure III
File Open Figure:

Figure IV

4, the results were printed information crawled

# 获取网页信息、状态码、地址

import urllib.request
file = urllib.request.urlopen("http://httpbin.org")
print(file.info())      # 网页信息
print(file.getcode())	# 返回状态码   返回200表示响应正确
print(file.geturl())	# 返回URL

FIG code execution results:

Figure V

5, set the timeout to reptiles

# 设置超时时间
# 在爬取网页时正确设置timeout的值，可以避免超时异常。格式为：urllib.request.urlopen("url",timeout=default)


import urllib.request
for i in range(1,10):
	try:
		file = urllib.request.urlopen("http://www.zhihu.com",
              timeout=0.3)    				# 打开网页超时设置为3秒
		data = file.read()
		print(len(data)) 		 		# 打印爬取内容的长度
	except Exception as e:       		# 捕捉异常
		print("异常了……"+str(e))

Code execution results are as follows:

Figure VI

6, with parameters crawling

'''
在浏览器输入网址www.codingke.com，可以打开扣丁学堂首页
然后在打开的页面检索关键词php，可以发现URL发生了变化，变成http://www.codingke.com/search/course?keywords=php。
这里keywords=php刚好是需要查询的信息，因此字段keywords对应的值就是用户检索的关键词。
由此可见，在扣丁学堂查询一个关键词时，会使用GET请求，其中关键字段是keywords，查询格式就是http://www.codingke.com/search/course?keywords=关键词。
若要实现用爬虫自动地在扣丁学堂上查询关键词是php的结果，并保存到本地文件，示例代码如下
'''

import urllib.request
keywd='php'
url='http://www.codingke.com/search/course?keywords='+keywd
req=urllib.request.Request(url)
data=urllib.request.urlopen(req).read()
fhandle=open("D:/php.html",'wb')
fhandle.write(data)
fhandle.close()

Code execution results:
Figure VII file open as shown:

Figure VIII

7, with Chinese parameter query

'''
当要检索的关键词是中文时，
需要使用urllib.request.quote()对关键词部分进行编码，编码后重新构造完整URL
'''


import urllib.request
url = 'http://www.codingke.com/search/course?keywords='
keywd = '开发'								 # 使用中文查询
key_code = urllib.request.quote(keywd) 		 # 对关键字编码
url_all = url+key_code 						 # 字符串拼接
req = urllib.request.Request(url_all)
data = urllib.request.urlopen(req).read()
fhandle = open('D:/dev.html','wb')
fhandle.write(data)
fhandle.close()

Code execution results:
Figure IX file open as shown:
Figure X

8, using a proxy

'''
# 设置代理服务
当使用同一个IP地址，频繁爬取网页时，网站服务器极有可能屏蔽这个IP地址。
在西刺网站中有很多免费代理服务器地址，其网址为http://www.xicidaili.com/
接下来通过一个示例示范使用代理IP进行爬取网页，比如地址为222.95.240.191，端口号为3000的代理IP
'''

import urllib
# 创建代理函数
def use_proxy(proxy_addr,url):
    import urllib.request
# 代理服务器信息
    proxy = urllib.request.ProxyHandler({'http':proxy_addr})
    #创建opener对象
    opener = urllib.request.build_opener(proxy,urllib.request.HTTPHandler)
    urllib.request.install_opener(opener)
    data = urllib.request.urlopen(url).read().decode('utf-8')
    return data
proxy_addr = '222.95.240.191:3000'
data = use_proxy(proxy_addr,"http://www.1000phone.com")
print('网页数据长度是：',len(data))


'''
例中首先创建函数use_proxy(proxy_addr, url)，该函数的功能是实现使用代理服务器爬取URL网页。
其中，第一个形参proxy_addr填写代理服务器的IP地址及端口，第二个参数url填写待爬取的网页地址。
通过urllib.request.ProxyHandler()方法来设置对应的代理服务器信息，
接着使用urllib.request.build_opener()方法创建一个自定义的opener对象，
该方法中第一个参数是代理服务器信息，第二个参数是类。
urllib.request.install_opener()创建全局默认的opener对象，
那么在使用urlopen()时也会使用本文安装的全局opener对象，
因此下面可以直接使用urllib.request.urlopen()打开对应网址爬取网页并读取，
紧接着赋值给变量data，最后将data的值返回给函数。
'''

Code execution results are as follows:
Execution failed proxy IP expire.

9, an exception handling

# 在程序运行中难免发生异常，对于异常的处理是编写程序时经常要考虑的问题。
'''
首先需要导入异常处理的模块——urllib.error模块
Python代码中处理异常需要使用try-except语句，
在try中执行主要代码，在except中捕获异常，并进行相应的异常处理。
    产生URLError异常的原因一般包括网络无连接、连接不到指定服务器、服务器不存在等。
在确保使用的计算机正常联网的情况下，下面通过处理一个不存在的地址（http://www.xyxyxy.cn）来演示URLError类处理URLError异常的过程。
'''

# 使用异常处理模块处理URL不存在异常   3-6
import urllib.request
import urllib.error
try:
    urllib.request.urlopen("http://www.a.b.c")		# 爬取不存在的url
except urllib.error.URLError as e:  					# 主动捕捉异常
    print(e.reason)                                     # 输出异常原因

'''
例中请求了一个不存在的url地址，
该错误会引发except程序块执行，并通过urllib.error.URLError as e捕获异常信息e，
输出了错误的原因（e.reason），错误的原因为“getaddrinfo failed”，即获取地址信息失败。
'''

Code execution results are as follows:

10, exception processing 2

'''
在使用URLError处理异常时，还有一种包含状态码的异常。
下面通过在千锋官网网址（http://www.1000phone.com）后拼接一个“/1”的错误网址来演示使用URLError类处理该类错误的过程，具体如例所示。
'''


# 使用异常处理模块处理URL错误的异常   3-7
import urllib.request
import urllib.error
try:
    urllib.request.urlopen("http://1000phone.com/1")  # 爬取不存在的url
except urllib.error.URLError as e:  				  # 主动捕捉异常
    # print(e)										  # 打印异常信息
    # print(dir(e)) 								  # 查看e的属性以方法
    print(e.code)                                     # 输出异常状态码
    print(e.reason)                                   # 输出异常原因

'''
例中请求了一个错误的URL地址，输出状态码“404”，异常原因是“Not Found”。
之前提到，产生异常的原因有如下几种：
    网络无连接。	连接不到指定服务器。	服务器无响应
在例中，404异常不属于上述三者，而是由于触发了HTTPError异常。
与URLError异常不同的是，HTTPError异常中一定含有状态码，而本例中之所以可以打印出状态码，是因为该异常属于HTTPError
'''

Code execution results are as follows:

11, exception processing 3

# 3-11 使用HTTPError类与URLError类处理异常
import urllib.error
import urllib.request
try:
    urllib.request.urlopen("http://www.1000phone.cc")
except urllib.error.HTTPError as e :		# 先用子类异常处理
    print(e.code)
    print(e.reason)
except urllib.error.URLError as e : 		# 再用父类异常处理
    print(e.reason)

12, exception handling 4

# 3-12 使用URLError处理HTTPError异常
import urllib.request
import urllib.error
try:
    urllib.request.urlopen("http://www.1000phone.cc")
except urllib.error.URLError as e :
    if hasattr(e,'code'): 		# 使用hasattr判断e中是否有code属性
        print(e.code) 			# 打印状态码
    print(e.reason)

13, exception handling

'''
在使用URLError处理异常时，还有一种包含状态码的异常。
下面通过在千锋官网网址（http://www.1000phone.com）后拼接一个“/1”的错误网址来演示使用URLError类处理该类错误的过程，具体如例所示。
'''

# 使用异常处理模块处理URL错误的异常   3-7
import urllib.request
import urllib.error
try:
    urllib.request.urlopen("http://1000phone.com/1")  # 爬取不存在的url
except urllib.error.URLError as e:  				  # 主动捕捉异常
    # print(e)										  # 打印异常信息
    # print(dir(e)) 								  # 查看e的属性以方法
    print(e.code)                                     # 输出异常状态码
    print(e.reason)                                   # 输出异常原因

'''
例中请求了一个错误的URL地址，输出状态码“404”，异常原因是“Not Found”。
之前提到，产生异常的原因有如下几种：
    网络无连接。	连接不到指定服务器。	服务器无响应
在例中，404异常不属于上述三者，而是由于触发了HTTPError异常。
与URLError异常不同的是，HTTPError异常中一定含有状态码，而本例中之所以可以打印出状态码，是因为该异常属于HTTPError
'''