python3 爬虫（一）--初识urllib

什么是Urllib库

Urllib是Python提供的一个用于操作URL的模块，我们爬取网页的时候，经常需要用到这个库。

升级合并后，模块中的包的位置变化的地方较多。在此，列举一些常见的位置变动，方便之前用Python2.x的朋友在使用Python3.x的时候可以快速掌握。

常见的变化有：

在Pytho2.x中使用import urllib2——-对应的，在Python3.x中会使用import urllib.request，urllib.error。
在Pytho2.x中使用import urllib——-对应的，在Python3.x中会使用import urllib.request，urllib.error，urllib.parse。
在Pytho2.x中使用import urlparse——-对应的，在Python3.x中会使用import urllib.parse。
在Pytho2.x中使用import urlopen——-对应的，在Python3.x中会使用import urllib.request.urlopen。
在Pytho2.x中使用import urlencode——-对应的，在Python3.x中会使用import urllib.parse.urlencode。
在Pytho2.x中使用import urllib.quote——-对应的，在Python3.x中会使用import urllib.request.quote。
在Pytho2.x中使用cookielib.CookieJar——-对应的，在Python3.x中会使用http.CookieJar。
在Pytho2.x中使用urllib2.Request——-对应的，在Python3.x中会使用urllib.request.Request。

模块

1.urllib.request模块是用来打开和读取URLs的；

2.urllib.error模块包含一些有urllib.request产生的错误，可以使用try进行捕捉处理；

3.urllib.parse模块包含了一些解析URLs的方法；

4.urllib.robotparser模块用来解析robots.txt文本文件.它提供了一个单独的RobotFileParser类，通过该类提供的can_fetch()方法测试爬虫是否可以下载一个页面。

简单的例子

import urllib.parse

import urllib.request

baidu = urllib.request.urlopen('http://baidu.com')
baidu = baidu.read()

print(baidu[:200])

url = 'http://baidu.com?q='
url_with_query = url+urllib.parse.quote_plus('python web scraping')

web_search = urllib.request.urlopen(url_with_query)
web_search = web_search.read()
web_search = web_search.decode("utf-8") ##解码
print(web_search)

当然这个前提是我们已经知道了这个网页是使用utf-8编码的，怎么查看网页的编码方式呢？需要人为操作，且非常简单的方法是使用使用浏览器审查元素，只需要找到head标签开始位置的chareset，就知道网页是采用何种编码的了

自动获取网页编码方式的方法

安装第三方库chardet

pip install chardet

from urllib import request
import chardet

if __name__ == "__main__":
    response = request.urlopen("http://fanyi.baidu.com/")
    html = response.read()
    charset = chardet.detect(html)
    print(charset)

{'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

返回的是一个字典，这样我们就知道网页的编码方式了，通过获得的信息，采用不同的解码方式即可