[Xiao Mu learns Python] requests of web crawlers

1 Introduction

Requests HTTP Library
Requests is an HTTP library, written in Python, for human beings.

requests is a commonly used HTTP request library, which can easily send HTTP requests to websites and get response results. The requests module is more concise than the urllib module.

The Requests library is developed on the basis of urllib. It is written in Python and uses the Apache2 Licensed (an open source protocol) HTTP library. Compared with urllib, Requests is more convenient and faster, so the Requests library is used more when writing crawlers.

GitHub:https://github.com/psf/requests
PyPI:https://pypi.org/project/requests/
官方文档:https://docs.python-requests.org/en/latest/
中文文档:https://docs.python-requests.org/zh_CN/latest/user/quickstart.html

insert image description here

  • (1) Install the requests library:
pip install requests
# or
python -m pip install requests
  • (2) Import the requests module
    To use requests to send HTTP requests, you need to import the requests module first:
import requests

insert image description here

2. The requests method

The requests method is as follows:

method describe
delete(url, args) Send a DELETE request to the specified url
get(url, params, args) Send a GET request to the specified url
head(url, args) Send a HEAD request to the specified url
patch(url, data, args) Send a PATCH request to the specified url
post ( url , data , json , args ) Send a POST request to the specified url
put(url, data, args) Send a PUT request to the specified url
request(method, url, args) Send the specified request method to the specified url

2.1 get()

res = requests.get(url,headers=headers,params,timeout)
参数说明如下:
url:要抓取的 url 地址。
headers:用于包装请求头信息。
params:请求时携带的查询字符串参数。
timeout:超时时间,超过时间会抛出异常。
  • Example 1:
import requests
r = requests.get('https://www.python.org')
r.status_code
  • Example 2:
import requests

kw = {
    
    's':'python 教程'}

# 设置请求头
headers = {
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.99 Safari/537.36"}
 
# params 接收一个字典或者字符串的查询参数,字典类型自动转换为url编码,不需要urlencode()
response = requests.get("https://www.xxx.com/", params = kw, headers = headers)

# 查看响应状态码
print (response.status_code)

# 查看响应头部字符编码
print (response.encoding)

# 查看完整url地址
print (response.url)

# 查看响应内容,response.text 返回的是Unicode格式的数据
print(response.text)

2.2 post()

  • Example 1:
import requests

payload = dict(key1='value1', key2='value2')
r = requests.post('https://httpbin.org/post', data=payload)
print(r.text)
  • Example 2:
import requests

# 表单参数,参数名为 name和 age
myobj = {
    
    'name': 'tomcat','age': '18'}

# 发送请求
x = requests.post('https://www.xxx.com/demo_post2.php', data = myobj)

# 返回网页内容
print(x.text)

2.3 request()

  • json: (optional) If you want to pass JSON data, you can directly pass in the json parameter:
params = {
    
    'key': 'value'}
requests.request(method="post", url="", json=params) # 内部自动序列化为JSON
  • cookies: (optional) dict
cs = {
    
    'token': '12345', 'status': 'working'}
requests.request(method="get", url="", cookies=cs)
  • files: (optional) Uploading files requires a more complex encoding format, but requests simplifies it into the files parameter.
    When reading files, be sure to use 'rb', that is, binary mode to read, so that the length of bytes obtained is the length of the file length
upload_files = {
    
    'file': open('report.xls', 'rb')}
requests.request(method="post", url="", files=upload_files)

3. requests response information

requests More response information is as follows:

property or method illustrate
apparent_encoding Encoding
close() close the connection to the server
content Return the content of the response, in bytes
cookies Returns a CookieJar object containing cookies sent back from the server
elapsed Returns a timedelta object containing the amount of time elapsed between sending a request and arriving a response, which can be used to test response speed. For example, r.elapsed.microseconds indicates how many microseconds it takes for the response to arrive.
encoding Decoding the encoding of r.text
headers Return response headers in dictionary format
history Returns a list of response objects (urls) containing the request history
is_permanent_redirect Returns True if the response is a permanently redirected url, otherwise returns False
is_redirect Return True if the response was redirected, False otherwise
iter_content() iterative response
iter_lines() iterate over the rows of the response
json() Returns a JSON object of the result (result needs to be written in JSON format, otherwise an error will be thrown)
links Returns the parsing header link of the response
next Returns a PreparedRequest object for the next request in the redirection chain
ok Check the value of "status_code" and return True if it is less than 400 and False if it is not
raise_for_status() If an error occurs, the method returns an HTTPError object
reason A description of the response status, such as "Not Found" or "OK"
request Returns the request object that requested this response
status_code Return http status codes, such as 404 and 200 (200 is OK, 404 is Not Found)
text Return the content of the response, unicode type data
url return the URL of the response

3.1 request.args

url = http://192.168.56.1:9933/good?id=1&name=abc

request.args就是获取请求链接中 ? 后面的所有参数;
把所有参数转换成一个列表,列表里面的元素是一个元组,结构为:('id','1')
再转换成一个字典,还有编码等操作;

获取具体某个参数时可以:
	id = request.args.get('id')
# url = 127.0.0.1:5000/?name=hua

from flask import Flask, jsonify
from flask import request
 
app = Flask(__name__)
 
 
@app.route('/', methods=['GET', 'POST'])
def hello_world():
    print('请求方式为------->', request.method)
    args = request.args.get("name") or "args没有参数"
    print('args参数是------->', args)
    form = request.form.get('name') or 'form 没有参数'
    print('form参数是------->', form)
    return jsonify(args=args, form=form)
 
if __name__ == '__main__':
	app.run(debug=True)

3.2 request.values

request.args和request.form都包含在request.values里面
即把两个列表拼接后,转换成一个字典

当request.args和request.form都有同一个键时,取到的时request.args里面的值
因为request.form会被先序列化,当request.args序列化时就会覆盖前面的

获取具体某个参数时可以直接:
id = request.args.get('id')
# values 是 args 和 form 两个字段的组合
@app.route("/hello", methods=["GET", "POST"])
def hello():
    print("content_type:", request.headers.get("content_type"))
    print("args:", request.args)
    print("form:", request.form)
    print("values:", request.values)
    return "hello"

3.3 request.json

要想获取request.json中的数据
    1. 请求参数属性得有传值过来
    2. 在Headers中必须设置 Content-Type:application/json
    
设置了Content-Type:application/json的Body数据只能通过request.json获取

request.json是把request.data的数据转换成JSON格式的数据
request.data的数据来源是请求参数属性Body
# 将content-type指定为application/json, flask就会将接收到的请求体数据做一次json编码转换,将字符串转换为字典对象,赋值给属性json
@app.route("/hello", methods=["GET", "POST"])
def hello():
    print("content_type:", request.headers.get("content_type"))
    print("data:", request.data)
    print("form:", request.form)
    print("json:", request.json)
    return ""
# 如果浏览器传过来的是json格式的字符串数据,但是请求头中又没有指定content-type :application/json,如果你直接调用request.json 会直接报错,返回401错误
@app.route("/hello", methods=["GET", "POST"])
def hello():
    print("content_type:", request.headers.get("content_type"))
    print("get_json:", request.get_json(force=True))
    return "hello"

3.4 request.form

原理跟request.args差不多,只是request.form的数据来源是form表单,其他操作基本一致
#form 顾名思义是表单数据,当请求头content-type 是 application/x-www-form-urlencoded 或者是 multipart/form-data 时,请求体的数据才会被解析为form属性。
#application/x-www-form-urlencoded 是浏览器的form表单默认使用的content-type。

<form action="http://localhost:8000/demo" method="post">
  <input type="text" name="username">
  <input type="password" name="password">
  <button>登录</button>
</form>

# 服务器接收到数据
@app.route("/hello", methods=["GET", "POST"])
def hello():
    print("content_type:", request.headers.get("content_type"))
    print('form:', request.form)
    print('data:', request.data)
    return "hello"

3.5 request.data

request.data返回的数据是一个bytes类型
数据来源是请求参数属性中的Body。
当请求没有设置Content-Type:application/json时,Body里面的数据都会放到request.data
# 发送的请求体中,当content-type不是multipart/form-data、application/x-www-form-urlencoded 这两种类型时,data才会有值,
# 例如我现在用postman指定的content-type是text/plain

@app.route("/hello", methods=["GET", "POST"])
def hello():
    print("content_type:", request.headers.get("content_type"))
    print("data:", request.data)
    print("form:", request.form)
    print("files:", request.files)
    return ""

3.6 request.files

request.files接收的是form表单中<input type="file" name="file"/>传过过来的数据

获取某个上传的文件对象
file = request.files.get('file')
保存到文件
file.save(path)

# 获取上传文件的名字
from werkzeug.utils import secure_filename
file_name = secure_filename(file.filename)
# client
# 当浏览器上传文件时,form表单需要指定 enctype为 multipart/form-data
<form action="http://localhost:8000/demo" method="post" enctype="multipart/form-data">
  <input type="text" name="myTextField">
  <input type="checkbox" name="myCheckBox">Check</input>
  <input type="file" name="myFile">
  <button>Send the file</button>
</form>

# server
@app.route("/hello", methods=["GET", "POST"])
def hello():
    print(request.headers.get("content_type"))
    print("files:", request.files)
    return ""

4. The get method of requests

Some parameters of the get request are listed below:

requests.get(url , params = None, **kwargs)
url : 拟获取页面的URL连接
params : URL中的额外参数,字典或字节流格式
**kwargs :12个控制访问的参数(可选项)

# 参数介绍:
json:JSON格式的数据,作为Request的内容
data:字典、字节序列或文件对象,作为Request的内容
headers:字典,HTTP定制头
cookies:字典或CookieJar,Request中的cookie
auth:元组,支持HTTP认证功能
files:字典类型,传输文件
timeout:设定超时时间,以秒为单位
proxies:字典类型,设定访问代理服务器,可以增加登录认证
allow_redirects:True/False,默认为True,重定向开关
stream:True/False,默认为True,获取内容立刻下载开关
varify:True/False,默认为True,认证SSL证书开关
cert:本地SSL证书路径

4.1 url

import  requests
url="http://www.baidu.com"
resp=requests.get(url)#向url对应的服务器发送相应的get请求,获得对应的相应 。

4.2 headers

import requests
url=r"https://www.baidu.com/s"
Headers={
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"
         }
response=requests.get(url=url,headers=Headers)

4.3 params

import requests
url=r"https://www.baidu.com/s"
#以带参数的Get请求,请求对应页面,比如百度搜索 Python,只需
Params={
    
    "wd":"Python"}
Headers={
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36"}
response=requests.get(url=url,params=Params,headers=Headers)
print(response.request.url)#输出:https://www.baidu.com/s?wd=Python

4.4 proxies

proxies: (optional) set proxy, can grab all http and https requests

# 作为用户代理,访问服务器会以该代理的ip访问服务器,可掩盖本机ip.
import requests
#proxies 是伪ip使用代理访问页面
#下面是使用代理访问百度
Headers={
    
    "User-Agent": "Mozilla/5.0 (Linux; Android 5.0; SM-G900P Build/LRX21T) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Mobile Safari/537.36"
         }
#proxies的格式是字典,类型:协议表示+域名+端口
proxies={
    
    
    "http":"http://1.192.242.107:9999"
    # "https":"https://192.168.0.1:80"
}
url="https://www.baidu.com"
resp=requests.get(url,headers=Headers,proxies=proxies)
print(resp.content.decode())

4.5 verify

verify: (optional) Boolean, controls whether to verify, the default is True. When verify is True, if you want to parse https content, you need to add the certificate path to the Cert parameter.
cert: (optional) string or tuple. string: Path to the ssl client certificate file (.pem). tuple: ("certificate", "key") pair

# ssl证书验证是否跳过,用于访问有些页面出现证书验证错误的时候
'''
当访问https页面出现证书错误,可以使用verify来取消验证
在get或者post请求的verify参数设置成False
requests.get(url,headers,data,prams,proxies,verify=False)
'''
import requests
url="https://www.12306.cn"
resp=requests.get(url,verify=False)
print(resp.content.decode())

4.6 timeout

timeout: (optional) access timeout, float(wait for server to send data ) or tuple(connect timeout, read timeout)

'''
通过添加timeout参数,能够保证在指定秒钟内返回响应,否则会报错
超时参数的使用
response = requests.get(url,timeout=3)通过添加timeout参数,能够保证在3秒钟内返回响应,否则会报错
'''
import  requests
proxies={
    
    "http":"http://1.192.242.107:9999"}
url="http://www.baidu.com"
try:
    resp=requests.get(url,proxies=proxies,timeout=3)
except :
    print("运行时出错")

4.7 cookies

  • Get Cookies:
import requests
r = requests.get('https://www.baidu.com')
#打印Cookies对象
print(r.cookies)
#遍历Cookies
for key,value in r.cookies.items():
	print(key+'='+value)
  • Use cookies to maintain login status:
import requests
headers = {
    
    
  'Cookie':'xxxxx',
  'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0',
  'Host':'www.zhihu.com'
}
r = requests.get('https://www.zhihu.com/yyyy',headers=headers)
print(r.text)
  • Set through the cookies parameter:
import requests

cookies ='xxxx'
jar = requests.cookies.RequestsCookieJar()
headers = {
    
    
  'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:95.0) Gecko/20100101 Firefox/95.0',
  'Host':'www.zhihu.com'
}
for cookie in cookies.split(';'):
  key,value = cookie.split('=',1)
  jar.set(key,value)
r = requests.get('https://www.zhihu.com/yyyy',headers=headers)
print(r.text)

4.8 auth

auth: (可选) Auth tuple to enable Basic/Digest/Custom HTTP Auth

  • HTTPBasicAuth
import requests
from requests.auth import HTTPBasicAuth
r = requests.get('http://localhost:8080/admin',auth=HTTPBasicAuth('admin','123456'))
print(r.status_code)
  • OAuth
# pip3 install requests_oauthlib
import requests
from requests_oauthlib import OAuth1
url = 'http://localhost:8080/admin'
auth = OAuth1("YOUR_APP_KEY","YOUR_APP_SECRET","USER_OAUTH_TOKEN","USER_OAUTH_TOKEN_SECRET")
requests.get(url,auth=auth)

3. Test code

3.1 Get the HTML of the webpage (get)


# -*- coding: UTF-8 -*-
import requests

def get_html(url):
    response = requests.get(url=url)
    result = response.text
    return result

if __name__ == '__main__':
    url = "http://www.baidu.com"
    html = get_html(url)
    print(html)

3.2 Get web page HTML (get with headers)

# -*- coding: UTF-8 -*-
import requests

def get_html(url, headers=None):
    response = requests.get(url=url)
    return response.text

if __name__ == '__main__':
    headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"
    }
    headers = ...
    url = "http://www.baidu.com"
    html = get_html(url, headers)
    print(html)

3.3 Get web page HTML (post with headers)


# -*- coding: UTF-8 -*-
import requests

def get_response(url, data, headers=None):
    response = requests.post(url, data, headers)
    result = response.text
    return result

if __name__ == '__main__':
    headers = {
    
    
        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.146 Safari/537.36"
    }
    data = {
    
    
        "key1": "value1",
        "key2": "value2"
    }
    url = "http://httpbin.org/post"
    html = get_response(url, data, headers)
    print(html)

3.4 Simulated login (get with headers)

The following is an example of using cookies to simulate a login request page

# -*- coding: UTF-8 -*-
import requests
import sys
import io

if __name__ == "__main__":
    # 登录后才能访问的网页
    url = 'http://www.csdn.net'

    # 浏览器登录后得到的cookie
    cookie_str = r'xxx=yyy;zzz=mmm'

    # 把cookie字符串处理成字典,以便接下来使用
    # TODO(You): 请正确准备cookie数据
	cookies = {
    
    }
	for line in cookie_str.split(';'):
	    key, value = line.split('=', 1)
	    cookies[key] = value
	    
    # 设置请求头
    headers = {
    
    
        'User-agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36'
    }

    # 在发送get请求时带上请求头和cookies
    resp = requests.get(
        url, 
        headers=headers, 
        cookies=cookies
    )

    print(resp.content.decode('utf-8'))

3.5 Download network pictures (get with headers)

  • Example 1:
import requests
url = 'https://ss2.bdstatic.com/70cFvnSh_Q1YnxGkpoWK1HF6hhy/it/u=38785274,1357847304&fm=26&gp=0.jpg'
#简单定义浏览器ua信息
headers = {
    
    'User-Agent':'Mozilla/4.0'}#读取图片需要使用content属性
html = requests.get(url=url,headers=headers).content
#以二进制的方式下载图片
with open('C:/image/python_logo.jpg','wb') as f:
    f.write(html)
  • Example 2:
# -*- coding:utf8 -*-
import requests
import re
from urllib import parse
import os
class BaiduImageSpider(object):
    def __init__(self):
        self.url = 'https://image.baidu.com/search/flip?tn=baiduimage&word={}'
        self.headers = {
    
    'User-Agent':'Mozilla/4.0'}
    # 获取图片
    def get_image(self,url,word):
        #使用 requests模块得到响应对象
        res= requests.get(url,headers=self.headers)
        # 更改编码格式
        res.encoding="utf-8"
        # 得到html网页
        html=res.text
        print(html)
        #正则解析
        pattern = re.compile('"hoverURL":"(.*?)"',re.S)
        img_link_list = pattern.findall(html)
        #存储图片的url链接 
        print(img_link_list)
        # 创建目录,用于保存图片
        directory = 'C:/image/{}/'.format(word)
        # 如果目录不存在则创建,此方法常用
        if not os.path.exists(directory):
            os.makedirs(directory)
        
        #添加计数 
        i = 1
        for img_link in img_link_list:
            filename = '{}{}_{}.jpg'.format(directory, word, i)
            self.save_image(img_link,filename)
            i += 1
    #下载图片
    def save_image(self,img_link,filename):
        html = requests.get(url=img_link,headers=self.headers).content
        with open(filename,'wb') as f:
            f.write(html)
        print(filename,'下载成功')
    # 入口函数 
    def run(self):
        word = input("请输入照片的关键词")
        word_parse = parse.quote(word)
        url = self.url.format(word_parse)
        self.get_image(url,word)
if __name__ == '__main__':
    spider = BaiduImageSpider()
    spider.run()

epilogue

如果您觉得该方法或代码有一点点用处,可以给作者点个赞,或打赏杯咖啡;╮( ̄▽ ̄)╭
如果您感觉方法或代码不咋地//(ㄒoㄒ)// ,就在评论处留言,作者继续改进;o_O???
如果您需要相关功能的代码定制化开发,可以留言私信作者;(✿◡‿◡)
感谢各位大佬童鞋们的支持!(´▽´)ノ (´▽´)! ! !

Guess you like

Origin blog.csdn.net/hhy321/article/details/129807827