【第三方包总结】

1、python有个第三方包叫chardet，它可以自动帮你识别出网页的编码

import chardet

import urllib2

#可根据需要，选择不同的数据

TestData = urllib2.urlopen('http://www.baidu.com/').read()

print chardet.detect(TestData)

准确的判断编码方式是utf-8.

我测试后，返回的编码却为这种None：

{'encoding': None, 'confidence': 0.0}

原来是这个页面的编码问题，该页面返回的是gzip编码，实际上每次应该判断页面信息的'Content-Encoding'是否为'gzip'。

urllib支持gzip页面自动解压而urllib2不支持。所以对于这种页面，先解压再read：

    Java代码  
  
try:   
    response = urllib2.urlopen(self.url, timeout = self.timeout)   
    if response.info().get('Content-Encoding', "") == 'gzip':   
        buf = StringIO.StringIO(response.read())   
        f = gzip.GzipFile(fileobj=buf)   
        content = f.read()   
    else:   
        content = response.read()   
        content = self.enc_dec(content)   
        return content   
    except socket.timeout:   
        log.warn("Timeout in fetching %s" % self.url)  

自己的解决方法是在请求时，将header中的

'Accept-Encoding':'gzip, deflate, sdch', 注释掉即可获取到正常的bytes数据

然后自己再次识别编码，结果为：{'encoding': 'utf-8', 'confidence': 0.99}

最后再bytes转换为 str,即可获得完整可识别的HTML代码。

2、import requests

requests是python的一个HTTP客户端库，跟urllib，urllib2类似，那为什么要用requests而不用urllib2呢？官方文档中是这样说明的：

python的标准库urllib2提供了大部分需要的HTTP功能，但是API太逆天了，一个简单的功能就需要一大堆代码。

下载了安装包（网页中download the tarball处链接），然后$ python setup.py install就装好了。

当然，有easy_install或pip的朋友可以直接使用：easy_install requests或者pip install requests来安装。

至于linux用户，这个页面还有其他安装方法。

测试：在IDLE中输入import requests，如果没提示错误，那说明已经安装成功了！
python requests的安装与简单运用

Python Requests快速入门

>>>import requests
>>> r = requests.get('http://www.zhidaow.com')  # 发送请求
>>> r.status_code  # 返回码 
200
>>> r.headers['content-type']  # 返回头部信息
'text/html; charset=utf8'
>>> r.encoding  # 编码信息
'utf-8'
>>> r.text  #内容部分（PS，由于编码问题，建议这里使用r.content）
u'<!DOCTYPE html>\n<html xmlns="http://www.w3.org/1999/xhtml"...'

3、爬取JS页面用的selenium

pip install selenium

Python爬虫实战：爬取JS组成的页面

【第三方包总结】

猜你喜欢