spider(Requests)

(一)基本理论

HttpRequest(请求)组成
1.请求的方式(Method)
    -get(参数在URL可见) | post(提交的表单数据不可见)
    -get参数字符不超过1024字节 | post提交数据字符数无限制
2.请求的URL(统一资源定位符)
    -URI(统一资源标识符)=URL + URN(统一资源名称)
    -URL是URI,但是URI不一定是URL
3.请求头(RequestHeader)
    -Accept(报头域:指定浏览器可接受的类型信息)
    -Accept-Language(语言类型)+Accept-Encoding(编码方式)
    -Host(请求资源的主机号ip和端口号)
    -Cookies(辨别用户而存储在本地的数据:应用于维持当前回话)
    -Referer(标识请求从那个页面发送:应用于防盗链处理)
    -User-Agent(UA:应用于伪装成浏览器)
    -Content-Type(互联网媒体类型)
4.请求体
    -get请求体为空|post请求体为表单数据

HttpResponse(响应)组成
1.响应状态码(status-code)
    -200(正常响应)|404(页面未找到)|500(服务器错误)
2.响应头(ResponseHeader)
    -Date+Last-Modified+Content-Encoding+Server+Expires
    -Set-Cookies
3.响应体(正文数据)
    -网页(HTML代码)
    -图片|视频|音频(二进制数据)

爬虫原理
1.获取网页(urllib+requests)
2.提取信息(xpath+beautifulSoup+pyquery+lxml)
3.保存数据(TXT|JSON+数据库[mysql|mongodb])

(二)Requests模块的使用

import requests

headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'}

#模拟GET请求
response = requests.get(url='http://httpbin.org/get',headers=headers)
if response.status_code == 200:
    print('文本格式:\n',response.text)
    print('JSON格式:\n',response.json())
else:
    print('网站加载出错')

#提交GET参数
params = {
    'name':'Python','age':25
}
try:
    response = requests.get(url='http://httpbin.org/get?',params=params,headers=headers)
    if response.status_code == 200:
        print('URL:',response.url)
except:
    print('GO DIE ...')

#保存二进制数据
pic_url = 'https://github.com/favicon.ico'
try:
    response = requests.get(url=pic_url,headers=headers)
    if response.status_code == 200:
        with open('icon.ico','wb') as f:
            f.write(response.content)
except:
    print('GO DIE...')


#模拟POST请求
data = {
    'name':'Jack','age':15,'hobby':'sing'
}
response = requests.post(url='https://httpbin.org/post',data=data,headers=headers)
if response.status_code == 200:
    print('URL:',response.url)
    print('响应头:\n',response.headers)
    print('cookies:\n',response.cookies)
    print('响应体:\n',response.text)
else:
    print('网站加载出错')

#文件上传
file = {'file':open('icon.ico','rb')}
response = requests.post(url='https://httpbin.org/post',files=file)
if response.status_code == 200:
    print(response.json()['files']['file'])

#SSL证书验证
header = {
    'host':'www.12306.cn',
    'referer':'http://www.12306.cn/mormhweb/',
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36'
}
response = requests.get(url='https://12306.cn/mormhweb',headers=header,verify=False,timeout=10)
print(response.status_code)

#获取cookies
url = 'https://www.baidu.com/s?'
params = {'wd':'python'}
try:
    response = requests.get(url=url,params=params,headers=headers)
    if response.status_code == 200:
        print('URL:',response.url)
        print('Cookies Type:',type(response.cookies))
        for key,value in response.cookies.items():
            print(key,'=',value)
except:
    print('GO DIE...')

#获取cookies并登录
cookie = 'd_c0="AKDmVSVBzA2PTjynnetTsX-qsL0bV8oyHNo=|1529825585"; ' \
         'q_c1=53c28e20a1944febaf790d40c87f4433|1529825585000|1529825585000; ' \
         '_zap=d4a8658c-aa0b-406d-bb05-0bdc59a586d9; ' \
         '__utma=155987696.193870559.1530005357.1530005357.1530005357.1; ' \
         '__utmz=155987696.1530005357.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); ' \
         '_xsrf=z6DuvIMdhU9v8kRZPx24qQggceNtH8Bo; tgw_l7_route=170010e948f1b2a2d4c7f3737c85e98c; ' \
         'capsion_ticket="2|1:0|10:1539067641|14:capsion_ticket|44:NzViODI0YmM1ZjgxNDc3NmJmZjI4MzJjMjJjNzc1OTI=|e89ccce191c0d6b410485fcf58eef346823f458547a9083fa157ddbd3bd7d93e"; ' \
         'z_c0="2|1:0|10:1539067676|4:z_c0|92:Mi4xLThWVUJnQUFBQUFBb09aVkpVSE1EU1lBQUFCZ0FsVk5ISjJwWEFDdTNIV0c1UjRnRWRRem5TaVozSzVpX3NJVWJR|fed207931e835dfffa6c4babb0f17ae744e18ec4814499a7e69f14a2d8c2b5ad"'

header = {
    'authority':'www.zhihu.com',
    'Cookie':cookie,
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.87 Safari/537.36',
    'referer':'ttps://www.zhihu.com/signup?next=%2F'
}
try:
    response = requests.get(url='https://www.zhihu.com',headers=header)
    print('URL:',response.url)
    print(response.text)
except:
    print('GO DIE...')

#会话维持
requests.get(url='http://httpbin.org/cookies/set/number/12306')
response = requests.get(url='http://httpbin.org/cookies')
print(response.text)
'''
{'cookies':{} }
'''
session = requests.Session()
session.get(url='http://httpbin.org/cookies/set/number/12306')
response = session.get(url='http://httpbin.org/cookies')
print(response.text)
'''
{'cookies':{'number':'12306'} }
'''

(一)基本理论

(二)Requests模块的使用

猜你喜欢