python爬虫之requests库

requests库介绍

发送http请求的第三方库，兼容python2和python3

安装：

pip install requests

使用：

import requests
发送请求
response = requests.get(url)
response = requests.post(url)

响应内容
请求返回的值是一个response对象，是对http协议服务端返回数据的封装
response对象主要属性和方法:

response.status_code	返回码
response.headers	返回的头信息，字典类型
response.content	响应的原始数据字节类型，图片、音频、视频一般用这种
response.encoding	text数据转码格式，先设置encoding，然后再取出text，就解决了乱码问题
response.text	响应的网页源代码，数据经过转码的字符串
response.cookies	服务器返回的cookies
response.json()	当结果为json格式数据时，把它转成字典

response = requests.get('http://www.baidu.com')
print(response.status_code) #200
print(response.headers) # 服务器返回的头信息
print(response.content) # 原始数据，字节类型
print(response.content.decode()) # 网页源码 已转码
print(response.text) # 网页源码 因转码方式为iso-8859,中文乱码  当返回的头信息中的content-type 有charset属性时，
                     # 转码按照charset的值来，如果没有charset而有text类型，则按照iso-8859来
response.encoding = 'utf-8'
print(response.text) # 网页源码 把转码方式设为utf-8,解决中文乱码
print(response.cookies)

查询参数

get请求对url进行传参（url拼接）

import requests
payload = {'wd':'python'}
response = requests.get('http://www.baidu.com/s?',params=payload)
response.encoding = 'utf-8'
print(response.text)
print(response.url) #打印最终请求的url http://www.baidu.com/s?wd=python

post请求提交参数

import requests
data = {'user':'qqq'} #参数
response = requests.post('http://httpbin.org/post',data=data)
response.encoding = 'utf-8'
print(response.text)

超时设置

import requests
response = requests.get('https://www.google.com',timeout=5) #5秒后还没有应答，就会报错超时，后续可以进行异常处理

cookies处理

比如登录页面之后，把cookies保存起来，然后在后续请求中，把cookies传入

data = {
    'account_name': 'asda',
    'password':'qwe123'
}
result = requests.post('https://qiye.163.com/login/',data=data)
if result.status_code == 200:
    cookies = result.cookies
    response = requests.get('https://qiye.163.com/News',cookies=cookies)

这样，带上登录后的cookies的请求，就可以正常地访问登录后数据了

session

为了维持客户端和服务端的通信状态

session= requests.session()
session.get()  #session对象的api和requests基本一样 并且用session请求，会自动保存cookies，并且下次请求会自己带上，方便

SSL认证：verify=False

requests.get('https://www.12306.com',verify=False)   #verify默认为True，当设置为False时，不验证证书

headers头信息

headers = {
'User=Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
}
r = requests.get('https://www.zhihu.com',headers=headers,verify=False) #添加头信息发送请求,不添加会被知乎拒绝访问

重定向：allow_redirects=False

headers = {
'User=Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
}
r = requests.get('https://www.zhihu.com',headers=headers,verify=False,allow_redirects=False) #关闭重定向

设置代理

proxies = {'http':'183.232.188.18:80','https':'183.232.188.18:80'}
r = requests.get(url='www.baidu.com',proxies=proxies) #使用代理进行请求

转换json格式数据

r = requests.get('http://httpbin.org/ip')
print(r.json()) #当返回的数据是json格式时，可以直接通过json()方法把json格式的数据转成字典

实例:用requests模拟github登录

'''
思路：github登录需要携带首页的cookies，并且设置头信息中的UA，而且post表单中有一个token参数需要请求首页才能得到
'''
import re
import requests
import urllib3
urllib3.disable_warnings()  # 取消警告


def get_params():
    start_url = 'https://github.com/login' # 从login页面获取cookies和token参数
    response = requests.get(start_url,verify=False)  # 关闭ssl验证
    cookies = response.cookies
    # print(response.text)
    token = re.findall(r'<input type="hidden" name="authenticity_token" value="(.*?)" /> ',response.text)[0] # 正则取出token
    return cookies,token


def login():
    post_url = 'https://github.com/session' #真正登录提交数据的页面
    cookies,token = get_params()
    # headers里面注意要有referer，表明是从该链接过来的，防盗链
    headers = {
        'Host': 'github.com',
        'Referer': 'https://github.com/login',
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36',
        'Accept-Encoding': 'gzip, deflate, br',
    }
    # data是通过抓包获取的，是登录时提交的表单参数
    data = {
        'commit': 'Sign in',
        'utf8': '✓',
        'authenticity_token': token,
        'login': 'xxxxxx',
        'password': 'xxxxxxxx',
    }
    r = requests.post(url=post_url,data=data,headers=headers,cookies=cookies,verify=False)
    print(r.text)


if __name__ == '__main__':
    login()

最后在输出的文本中搜索一下 Start a project(我们在浏览器进入github，首页里有这个)

搜索到说明登录成功了！