初识网络爬虫-网络爬虫概述

3.1 网络爬虫概述

3.1.1 网络爬虫及其应用

分类：通用，聚焦，增量，深层
搜索引擎：通用网咯爬虫
定向抓取相关网页中资源：聚焦爬虫
增量式爬虫：针对已经更新的网页资源
深层网络爬虫：隐藏在表层链接后面的web页面
网络爬虫实际运用场景：BT网站；云盘搜索；

3.1.2 网络爬虫结构

在这里插入图片描述

3.2 HTTP请求python实现

三种方式：urllib2/urllib，httplib/urllib以及Requests

3.2.1 urllib2/urllib实现

1.向指定的url发出请求：

import urlliib2
response = urllib2.urlopen('http://www.zhihu.com')
html = response.read()
print(html)

分解为请求和响应：

import urllib2
#请求
request = urllib2.Request('http://www.zhihu.com')
#响应
response = urllib2.urlopen(request)
html = response.read()
print(html)

POST请求，添加请求数据

import urllib
import urllib2
url = 'http://www.zhihu.com'
postdata = {'username','qiye','passward','qiye-pass'}
data = urllib.urlcode(poatdata)
req = urllib2.Request(url,data)
response = urllib2.urlopen(req)
html = response.read()

2.请求头headers处理

import urllib
import urllib2
url = 'http://www.zhihu.com'
user_agent = '...'
referer = '...'
postdata = {...}
#将user-agent和referer写入头信息
headers = {'User-Agent':user-agent,'Referer：referer}
data = urllib.urlcode(poatdata)
req = urllib2.Request(url,data,headers)
response = urllib2.urlopen(req)
html = response.read()

3.cookie处理
得到某个cookie项的值

import urllib2
import cookielib
cookie = cookielib.CookieJar()
#设置打开方式
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
response = opener.open('...')
for item in cookie:
    print item.name+':'+item.value

#自己添加cookie内容

import urllib2
opener = urllib2.build_opener()
opener.addheaders.append(('Cookie','email='+"..."))
req = urllib2.Request("...")
response = opener.open(req)
print(response.headers)
retdata = response.read()

4.设置超时信息Timeout

import urllib2
request = urllib2.Request('...')
response = urllib2.urlopen(request,timeout=2)
html = response.read()
print(html)

5.获取HTTP响应码

import urllib2
try:
    response = urllib2.urlopen('...')
    print(response)
except urllib2.HTTPError as e:
    if hasattr(e,'code'):
        print('Error code:',e.code)

6.重定向
在这里插入图片描述
7.Proxy的设置

import urllib2
proxy = urllib2.ProxyHandler({'http': '127.0.0.1:8087'})
opener = urllib2.build_opener([proxy,])
urllib2.install_operner(opener)
response = urllib2.urlopen('...')
print(response.read())

3.2.2 http/urllib实现

httplib模块式一个底层模块，可以看到HTTP请求的每一步，在爬虫开发过程基本用不到，这里进行知识普及：
在这里插入图片描述

3.2.3 更人性化的Requests

1.完整请求响应模型
GET:

import requests
r = requests.get('...')
print(r.content)

POST:

import requests
postdata = {...}
r = requests.post('...',data=postdata)
print(r.content)

在这里插入图片描述
2.响应与编码

import requests
r = resquests.get('...')
print('content-->>'+r.content)
print('text-->>'+r.text)
print('encoding-->>'+r.encoding)
r.encoding = 'utf-8'
print('new text -->>'+r.text)

字符串/文件编码检测模块chardet
直接将chardet检测到的编码，赋值给r.encoding实现编码，r.text输出就不会有乱码

import requests
r = request.get('...')
print(chardet.detect(r.content))
r.encoding = chardet.detect(r.content)['encoding']
print(r.text)

3.请求头headers处理

import requests
user_agent = '...'
headers = {'User-Agent':user_agent}
r = requests.get('...',headers = headers)
print(r.content)

4.响应码code和响应头headers的处理
获取响应码：status_code字段
获取响应头：headers字段

import requests
r = requests.get('...')
if r.status_code == requests.codes.OK:
    print(r.status_code)#获取响应码
    print(r.headers)#获取响应头
    print(r.headers.get('content-type'))#获取其中字段
else:
    r.raise_for_status()#主动抛出异常

5.cookie处理
获取cookie字段

import requests
user_agent = '...'
headers = {'User-Agent':user-agent}
r = requests.get('...',headers = headers)
for cookie in r.cookie.keys():
    print(cookie+':'+r.cookie.get(cookie))

添加自定义cookie

import requests
user_agent = '...'
headers = {'User-Agent':user-agent}
cookies = dict(name='qiye',age='10')
r = requests.get('...',headers = headers,cookies = cookies)
print(r,text)

Requests提供session概念自动给程序添加cookies

import requests
loginurl = '...'
s = requests.Session()
#首先访问登陆界面作为游客，服务器会分配一个cookie
r = s.get(loginurl,allow_redirects=True)
datas = {'name':'qiye',apsswd':'qiye'}
#向登录链接发送post请求，验证成功，游客权限转为会员权限
r = s.post(loginurl,data,allow+True)
print(r,text)

6.重定向与历史信息
处理重定向：allow_redirects
查看历史信息：r.history

import requests
r = requests.get('...')
print(r.url)
print(r.status_code)
print(r.history)

7.超时设置

requests.get('...',timeout=2)

8.代理设置

import requests
proxies = {"....","......"}
requests.get("...",proxies = proxies)