【python】——爬虫02 requests 模块

一、requests概述

  • 非转基因的python HTTP模块
  • 发送http请求,获取响应数据
  • 安装 pip/pip3 install requests

二、发送GET

请求

1. requests发送get请求

import requests

url = "https://www.baidu.com"

# 发送get请求
response = requests.get(url)


print(response.text)

在这里插入图片描述

2. response 响应对象

  • response.text

类型: str
解码类型:requests模块自动根据HTTP头部对响应的编码作出有根据的推测,推测的文本编码

  • response.content

类型:bytes
解码类型:没有指定

2.1 解决中文乱码

对response.content进行decode,解决中文乱码
response.content.decode()默认utf-8

import requests

url = "https://www.baidu.com"

# 发送get请求
response = requests.get(url)
response.encoding = 'utf8'


print(response.content)
print(response.content.decode()) 

在这里插入图片描述

2.2 response响应对象的属性、方法

import requests

url = "https://www.baidu.com"

# # 发送get请求
response = requests.get(url)
response.encoding = 'utf8'

# 响应url
print(response.url)
# https://www.baidu.com/

# 状态码
print(response.status_code) # 200

# 请求头
print(response.request.headers)
# {'User-Agent': 'python-requests/2.25.1',
#  'Accept-Encoding': 'gzip, deflate', 'Accept': '*/*',
#  'Connection': 'keep-alive'}

# 响应头
print(response.headers)

# {'Cache-Control': 'private, no-cache, no-store,
#  proxy-revalidate, no-transform', 'Connection': 'keep-alive',
# 'Content-Encoding': 'gzip', 'Content-Type': 'text/html',
# 'Date': 'Mon, 01 Nov 2021 13:51:59 GMT',
# 'Last-Modified': 'Mon, 23 Jan 2017 13:24:18 GMT',
# 'Pragma': 'no-cache', 'Server': 'bfe/1.0.8.18',
# 'Set-Cookie': 'BDORZ=27315; max-age=86400;
# domain=.baidu.com; path=/', 'Transfer-Encoding': 'chunked'}


# 答应响应设置cookie
print(response.cookies)
# <RequestsCookieJar[<Cookie BDORZ=27315 for .baidu.com/>]>

3. requests发送header请求

requests.get(url,headers=headers)
  • headers 参数接收字典形式的请求头
  • key请求头字段,value字段对应值
import requests

url = 'https://www.baidu.com'

# 定制请求字典
headers = {
    
    
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.40'

}
# 发送( 披着羊皮的狼——伪装浏览器
response = requests.get(url,headers= headers)
print(response.request.headers)

三、cookie

1. requests发送header请求携带cookie

在这里插入图片描述

title里有.github: 登录账号不成功
在这里插入图片描述

import requests

url = 'https://github.com/M1kaelson'



headers = {
    
    
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0'
}
temp = ''
temp=temp.encode("utf-8").decode("latin-1")
cookie_list = temp.split(';')
cookies = {
    
    }

for cookie in cookie_list:
    # cookies是一个字典,里面的值是
    cookies[cookie.split('=')[0]] = cookie.split('=')[-1]
print(cookies)

response = requests.get(url , headers=headers,cookies=cookies,timeout=60)

with open("github_with_cookies_.html","wb")as f:
    f.write(response.content)

在这里插入图片描述 在这里插入图片描述
在这里插入图片描述
(之前不成功,是因为校园网。。。。4G网直接ok

2. cookiejar对象

  • 构建cookies字典
  • cookies = {"name":"value"}
import requests

url = ''



headers = {
    
    
    'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0'
}
temp = ''

cookie_list = temp.split(';')
cookies = {
    
    }

for cookie in cookie_list:
    # cookies是一个字典,里面的值是
    cookies[cookie.split('=')[0]] = cookie.split('=')[-1]
print(cookies)

response = requests.get(url , headers=headers,cookies=cookies)

with open("github_with_cookies_.html","wb")as f:
    f.write(response.content)

四、 参数

1. timeout

response = requests.get (url,timeout=3)

2. proxy

在这里插入图片描述

  • 正向
  • 反向(nginx 浏览器不知道服务器真实的地址

3. 协议

  • http、https
  • socks

socks只是简单的传递数据包,不关心应用层协议
socks费时比http https少
socks代理可以转发http https请求

response= requests.get(url,proxies=proxies)
# proxies的形式:字典
import requests

url = 'http://www.google.com'
proxies = {
    
    
    'http':'http://ip:port'
}

response = requests.get(url,proxies=proxies)
print(response.text)

4. verify

  • 使用verify忽略CA证书
  • 为了在代码中正常请求,使用verify=false参数,此时requests模块发送请求将不做CA证书
response = request.get(url,verify=False)

五、发送POST请求

  • response= request.post(url,data)
  • data参数接收一个字典
  • 其他参数和get的参数一致

1. 发送post包,实现金山单词翻译

# url
# headers
# data字典
# 发送请求 获取响应
# 数据解析

1.1 抓包确定请求URL

在这里插入图片描述

1.2 确定请求参数

在这里插入图片描述
下面这个是有道词典的,应该是反爬。。。搞不来
在这里插入图片描述

1.3 确定返回数据的位置

1.4 模拟浏览器获取数据

终于成功了
在这里插入图片描述

#coding:utf-8
import requests

import json


class King(object):

    def __init__( self,word ):
        # url
        self.url = "http://ifanyi.iciba.com/index.php?c=trans&m=fy&client=6&auth_user=key_ciba&sign=37218aa29f55fdcc"
         # headers
        self.headers = {
    
    
            'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:93.0) Gecko/20100101 Firefox/93.0'
        }
        # data

        self.data ={
    
    

            "from": "zh",
            "to": "en",
            "q": word


        }


    def get_data(self):
        response =requests.post(self.url,data=self.data,headers=self.headers)
        return response.content.decode('unicode-escape')
   

    def run(self):
        response = self.get_data()
        print(response)
     

if __name__ == '__main__':
    King = King('字典')
    King.run()

    # with open("fanyi.html", "wb")as f:
    #     f.write(Youdao.run())

2. 数据来源

  1. 固定值 ——抓包比较不变值
  2. 输入值
  3. 预设值——静态文件
  4. 预设值——发请求
  5. 在客户端生产——分析js,模拟生成数据

3. requests.session模块

  • 自动处理cookie
  • 下一次请求会带上前一次的cookie
  • 用于连续的多次请求
session = requests.session()
response = session.get(url,headers,..)
response = session.post(url,data,...)

4. 用session保存会话,实现github登录

思路

	# session


    # headers

    # url1-获取token
        # 发送请求获取相应
        # 正则提取


    # url2-登录
        # 构建表单数据
        # 发送请求登录
import requests
import re
def login():
    # session
    session = requests.session()

    # headers
    session.headers = {
    
    
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/95.0.4638.54 Safari/537.36 Edg/95.0.1020.40'

    }

    # url1-获取token
        # 发送请求获取相应
        # 正则提取

    url1 = 'https://github.com/login'
    res_1 = session.get(url1).content.decode()

    # 获取token
    token = re.findall('name="authenticity_token" value="(.*?)" />', res_1)[0]
    print(token)



    # url2-登录
        # 构建表单数据
        # 发送请求登录

    url2 = 'https://github.com/session'
    data = {
    
    
        'commit': 'Sign in',
        'authenticity_token':token,
        'login': 'M1kaelson',
        'password': 'xxxxxxx,
        # 'trusted_device':''
        'webauthn-support': 'supported',
        # 'webauthn-iuvpaa-support':'unsupported'
        # 'return_to':'https://github.com/signup?ref_cta=Sign+up&ref_loc=header+logged+out&ref_page=%2F&source=header-home'
        'allow_signup': '',
        # 'client_id':''
        # 'integration':''
        # 'required_field_771d':''
        # 'timestamp':'1636085932515'
        # 'timestamp_secret':'262637e02cba4e7372cde23e07019cee2b70afd16525d44e160b38b60bb4fc8f'

    }
    print(data)
    session.post(url2, data=data)



    # url3- 验证
    url3 = 'https://github.com/M1kaelson'
    response = session.get(url3)
    with open('github.html','wb')as f:
        f.write(response.content)

if __name__ == '__main__':
    login()

猜你喜欢

转载自blog.csdn.net/grb819/article/details/121088581