（2）获取网页源代码——Python

Python版：超级简短

#!/usr/bin/python
#-*- coding: utf-8 -*-
import urllib2
response = urllib2.urlopen("http://www.baidu.com")
print response.read()

POST方式：

#!/usr/bin/python
#-*- coding: utf-8 -*-

import urllib
import urllib2

values = {"username":"[email protected]","password":"XXXX"}
data = urllib.urlencode(values)
url = "https://passport.csdn.net/account/login?from=http://my.csdn.net/my/mycsdn"
request = urllib2.Request(url,data)
response = urllib2.urlopen(request)
print response.read()

GET方式：

#!/usr/bin/python
#-*- coding: utf-8 -*-

import urllib
import urllib2

values={}
values['username'] = "1016903103@qq.com"
values['password']="XXXX"
data = urllib.urlencode(values)
url = "http://passport.csdn.net/account/login"
geturl = url + "?"+data
#print geturl
request = urllib2.Request(geturl)
response = urllib2.urlopen(request)
print response.read()

Python优化版：返回错误信息，设置Headers、Proxy

urlopen函数：urlopen(url, data, timeout)

第一个参数url即为URL，第二个参数data是访问URL时要传送的数据，第三个timeout是设置超时时间。

第一个参数URL是必须要传送的，第二三个参数是可以不传送的，data默认为空None，timeout默认为 socket._GLOBAL_DEFAULT_TIMEOUT。如果第二个参数data为空那么要特别指定是timeout是多少，写明形参，如果data已经传入，则不必声明。即：
```
response = urllib2.urlopen('http://www.baidu.com', timeout=10)
response = urllib2.urlopen('http://www.baidu.com',data, 10)
```
设置Headers：

在构建request时传入一个headers，在请求时，就加入了headers传送，服务器若识别了是浏览器发来的请求，就会得到响应。

headers的一些属性，下面的需要特别注意一下：
1. User-Agent : 有些服务器或 Proxy 会通过该值来判断是否是浏览器发出的请求
  
  扫描二维码关注公众号，回复： 3359938 查看本文章
2. Content-Type : 在使用 REST 接口时，服务器会检查该值，用来确定 HTTP Body 中的内容该怎样解析。
3. application/xml ：在 XML RPC，如 RESTful/SOAP 调用时使用
4. application/json ：在 JSON RPC 调用时使用
5. application/x-www-form-urlencoded ：浏览器提交 Web 表单时使用
在使用服务器提供的 RESTful 或 SOAP 服务时， Content-Type 设置错误会导致服务器拒绝服务
对付“反盗链”的方式：

服务器会识别headers中的referer是不是它自己，如果不是，有的服务器不会响应，所以我们还可以在headers中加入referer
```
headers = { 'User-Agent' : 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)'  ,'Referer':'http://www.zhihu.com/articles' }
```
设置Proxy（代理）：

urllib2 默认会使用环境变量 http_proxy 来设置 HTTP Proxy。假如一个网站它会检测某一段时间某个IP 的访问次数，如果访问次数过多，它会禁止你的访问。所以你可以设置一些代理服务器来帮助你做工作，每隔一段时间换一个代理，就不会被禁了。

解决乱码问题：

如果原来的网页的编码是gb2312或gbk，由于显示的是utf-8而乱码的话，可以通过代码来转换编码格式：

html= response.read()
html=html.decode('gbk','ignore')#将gbk编码转为unicode编码
html=html.encode('utf-8','ignore')#将unicode编码转为utf-8编码

完整代码：

#!/usr/bin/python
#-*- coding: utf-8 -*-
#第一行在PyCharm中必须要有，第二行是设置输出的编码格式

import urllib #导入包
import urllib2 
import cookielib

try:
    url = 'http://www.*.com/login'

    #设置代理Proxy
    enable_proxy = True
    proxy_handler = urllib2.ProxyHandler({"http" : 'http://some-proxy.com:8080'})
    null_proxy_handler = urllib2.ProxyHandler({})
    if enable_proxy:
        opener = urllib2.build_opener(proxy_handler)
    else:
        opener = urllib2.build_opener(null_proxy_handler)
urllib2.install_opener(opener)

    #设置Headers
    user_agent = 'Mozilla/4.0 (compatible; MSIE 5.5; Windows NT)' #代理服务器
    headers = { 'User-Agent' : user_agent }

    #POST方式
    values = {'username' : 'cqc',  'password' : 'XXXX' } #POST内容
    data = urllib.urlencode(values)

    #获取网页源代码
    request = urllib2.Request(url, data, headers)#请求
    response = urllib2.urlopen(request)#响应
    connect = response.read()#返回网页内容

    '''如果网页编码格式是gbk的话
    connect = connect.decode('gbk','ignore')#将gbk编码转为unicode编码
    connect = connect .encode('utf-8','ignore')#将unicode编码转为utf-8编码'''

    print content
except urllib2.HTTPError, e:
    print e.code #输出错误代码
except urllib2.URLError, e:
    print e.reason #输出错误信息

（2）获取网页源代码——Python

猜你喜欢