Three: Crawler-network request module (Part 2)

Three: Network request module (Part 2)

1.RequestsModel:

Requests is written in Python language, based on urllib, using the Apache2 Licensed open source protocol < a i=4> library, which is more convenient than , can save us a lot of work, and fully meets testing needsHTTPurllibHTTP

Requests’s philosophy was developed around the idiom of PEP 20 (a standard norm), so it is better than < a i=3>More conciseurllib

(1)RequestsModule installation:

Requests is a third-party library in the Python language, specially used to send HTTP requests

Installation method:

#1.在终端输入
pip install requests
#2.换源安装:若出现下载超时,过慢的情况,换源即可

# 示例
pip install requests -i https://pypi.tuna.tsinghua.edu.cn/simple/

# 阿里云 http://mirrors.aliyun.com/pypi/simple/
# 豆瓣http://pypi.douban.com/simple/
# 清华大学 https://pypi.tuna.tsinghua.edu.cn/simple/
# 中国科学技术大学 http://pypi.mirrors.ustc.edu.cn/simple/
# 华中科技大学http://pypi.hustunique.com/

(2)Requests usage:

1. Commonly used methods:

requests.get("网址")

request.post("网址")

2. Commonly used parameters:

url: url address, the interface request address marked in the interface document

params: Request the link in the data. In a common get request, the request parameters are all in the url address

data: Request data, the parameter is the data format of the form

json: Common data request formats for interfaces

headers: Request information

cookie: Saved user login information, such as some recharge functions, but the user needs to be logged in and requires cookie information request information transmission< /span>

#推荐一个爬虫工具:爬虫工具库
# https://spidertools.cn/#/curl2Request

3. Response content:

r.encoding: Get the current encoding

r.encoding = 'charset': Set the encoding format, mostly 'utf-8'encoding

r.text: Parse the returned content with encoding. The response body in string format will be automatically decoded according to the font encoding of the response header

r.cookies: Replycookie

r.headersNone

r.status_code:Response status code

r.json(): decoder built in Requests to Return in the form, the premise is that the returned content must be in the format of , otherwise an exception will be thrown if there is a parsing errorJSONjsonjson

r.content: Byte stream, returned in byte form (binary), the response body in byte form will be automatically decoded for yougzip anddeflatecompression

4.RequestsMediumget request parameter application - Sogou search One Piece case:

(1) Add parameters tourl link - civilian writing (the most commonly used writing method)

#导入模块
import requests

#目标url
url = 'https://www.sogou.com/web?query=%E6%B5%B7%E8%B4%BC%E7%8E%8B&_asf=www.sogou.com&_ast=&w=01015002&p=40040108&ie=utf8&from=index-nologin&s_from=index&oq=&ri=0&sourceid=sugg&suguuid=&sut=0&sst0=1678439364372&lkt=0%2C0%2C0&sugsuv=1666847250696567&sugtime=1678439364372'

#请求头部信息
headers = {
    
    
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/98.0.4758.102 Safari/537.36",
    'Cookie':'ABTEST=8|1701776265|v17; SNUID=7B0346FA646568A5CDB26F5B64A6691E; IPLOC=CN3205; SUID=186722991B5B8C0A00000000656F0B89; cuid=AAFqAmPwSAAAAAqMFCnulgEASQU=; SUV=1701776266958330; browerV=3; osV=1; sst0=372'

}

#发起请求
response = requests.get(url, headers=headers)

#指定编码格式
response.encoding = 'utf-8'

#获取响应
html = response.text

#文件写入响应数据内容
with open("海贼王1.html","w",encoding='utf-8') as i:
    i.write(html)

#打印响应(数据)内容
print(html)

Note: 1. Values ​​need to be separated by commas

(2) Add parameters toparams – official writing method (not commonly used)

#导入模块
import requests

#请求头部信息 -- 字典的形式
headers = {
    
    
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
    #cookie的两种设置方式
    #(1)'Cookie':'ABTEST=8|1701776265|v17; SNUID=7B0346FA646568A5CDB26F5B64A6691E; IPLOC=CN3205; SUID=186722991B5B8C0A00000000656F0B89; cuid=AAFqAmPwSAAAAAqMFCnulgEASQU=; SUV=1701776266958330; browerV=3; osV=1; sst0=372'
}

#cookie值 -- (2)字典的形式
cookies = {
    
    
    "ABTEST": "8^|1701776265^|v17",
    "SNUID": "7B0346FA646568A5CDB26F5B64A6691E",
    "IPLOC": "CN3205",
    "SUID": "186722991B5B8C0A00000000656F0B89",
    "cuid": "AAFqAmPwSAAAAAqMFCnulgEASQU=",
    "SUV": "1701776266958330",
    "browerV": "3",
    "osV": "1",
    "sst0": "372"
}

#目标url
url = "https://www.sogou.com/web"

#get请求所携带的参数
params = {
    
    
    "query": "海贼王",
    "_asf": "www.sogou.com",
    "_ast": "",
    "w": "01015002",
    "p": "40040108",
    "ie": "utf8",
    "from": "index-nologin",
    "s_from": "index",
    "oq": "",
    "ri": "0",
    "sourceid": "sugg",
    "suguuid": "",
    "sut": "0",
    "sst0": "1678439364372",
    "lkt": "0^%^2C0^%^2C0",
    "sugsuv": "1666847250696567",
    "sugtime": "1678439364372"
}
response = requests.get(url, headers=headers, cookies=cookies, params=params) #因为cookie中不只有一个键值对所以要加s

print(response.text)
print(response)

#文件写入响应数据内容
with open("海贼王2.html","w",encoding='utf-8') as i:
    i.write(response.text)

Official way of writing, quick way of writing:

1. Enter the target pageurl and perform the following operations:

Insert image description here

2. Enter the crawler tool library and perform the following operations:

Insert image description here

3. Get the following code:

import requests

headers = {
     
     
    "authority": "www.sogou.com",
    "accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.7",
    "accept-language": "zh-CN,zh;q=0.9",
    "cache-control": "max-age=0",
    "sec-ch-ua": "^\\^Google",
    "sec-ch-ua-mobile": "?0",
    "sec-ch-ua-platform": "^\\^Windows^^",
    "sec-fetch-dest": "document",
    "sec-fetch-mode": "navigate",
    "sec-fetch-site": "none",
    "sec-fetch-user": "?1",
    "upgrade-insecure-requests": "1",
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36"
}
cookies = {
     
     
    "ABTEST": "8^|1701776265^|v17",
    "SNUID": "7B0346FA646568A5CDB26F5B64A6691E",
    "IPLOC": "CN3205",
    "SUID": "186722991B5B8C0A00000000656F0B89",
    "cuid": "AAFqAmPwSAAAAAqMFCnulgEASQU=",
    "SUV": "1701776266958330",
    "browerV": "3",
    "osV": "1",
    "sst0": "372"
}
url = "https://www.sogou.com/web"
params = {
     
     
    "query": "^%^E6^%^B5^%^B7^%^E8^%^B4^%^BC^%^E7^%^8E^%^8B",
    "_asf": "www.sogou.com",
    "_ast": "",
    "w": "01015002",
    "p": "40040108",
    "ie": "utf8",
    "from": "index-nologin",
    "s_from": "index",
    "oq": "",
    "ri": "0",
    "sourceid": "sugg",
    "suguuid": "",
    "sut": "0",
    "sst0": "1678439364372",
    "lkt": "0^%^2C0^%^2C0",
    "sugsuv": "1666847250696567",
    "sugtime": "1678439364372"
}
response = requests.get(url, headers=headers, cookies=cookies, params=params)

print(response.text)
print(response)

Note: (1) The official writing method is not recommended because it will carry all parameters, and some parameters may There is anti-climbing

​ (2) The data type of params is dictionary data, which must satisfy key-value pairs

5. How to quickly match the parameters in headers into dictionary data (understand):

#使用正则替换
1.将 headers 中的参数内容全部选中
2.按住 CTRL + R 键:会弹出一个提示框
3.在第一行输入正则语法:(.*):\s(.*)$
4.在第二行输入正则语法:"$1":"$2",
5.点击 regax
6.点击 Replace All

Note: 1. Because there is already a crawler tool library, this method cannot be used. It is regarded as an extension. Just fine

​ 2. When writing, be careful not to make mistakes in grammar, especially the English comma after the regular grammar in the second line. Be sure to add

Operation diagram of regular substitution method:

Insert image description here
Insert image description here

6.RequestsMiddlepostRequest:

post Most of the usage is the same as get request, but you need to add data parameters

​ Usage scenarios of the post method:

1. When the web page requires login

2. When it is necessary to transmit content to the web page

Syntax format:

response = requests.post("网址", data = data,headers=headers)

360 Translation – Examples of English-Chinese translation:

#导入模块
import requests

#目标url
'''
url分析 -- 实现英文翻译成中文:
https://fanyi.so.com/index/search?eng=1&validate=&ignore_trans=0&query=love
https://fanyi.so.com/index/search?eng=1&validate=&ignore_trans=0&query=like
https://fanyi.so.com/index/search?eng=1&validate=&ignore_trans=0&query=enjoy
根据多个英文翻译的url分析,发现改变的只有单词不同,其余的内容完全一致,也就是说我们可以把最后面单词的位置设置成一个变量,这样的话就可以实现单词翻译,
而不是想翻译一个单词就去改变一下它的url
'''

'''
url分析 -- 实现中文翻译成英文:
https://fanyi.so.com/index/search?eng=0&validate=&ignore_trans=0&query=%E7%88%B1%E6%83%85
https://fanyi.so.com/index/search?eng=0&validate=&ignore_trans=0&query=%E5%96%9C%E6%AC%A2
https://fanyi.so.com/index/search?eng=0&validate=&ignore_trans=0&query=%E4%BA%AB%E5%8F%97
'''

'''
根据两种翻译的url分析,发现除结尾不同外其eng也不同,一个是0;一个是1
两种方法都实现的实现方法: 
第一个方法: if判断
第二个方法: 函数
'''

print("中文 --> 英文 ; 点击0")
print("英文 --> 中文 ; 点击1")
choose = input("请输入你的选择: ")
word = input("请输入你想翻译的内容: ")
url = f'https://fanyi.so.com/index/search?eng={
      
      choose}&validate=&ignore_trans=0&query={
      
      word}'


#请求头信息
headers = {
    
    
    "User-Agent":"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/119.0.0.0 Safari/537.36",
    'Pro':'fanyi'  # 查看后发现在这个参数中做了反爬,所以这个参数必须要加上
}

#post请求要携带的参数
data = {
    
    
    'eng': f'{
      
      choose}',  # 注意改变参数,因为url中的eng参数已经被我们设置成了choose变量
    'validate': "",
    'ignore_trans': '0',
    'query': f'{
      
      word}'  # 注意改变参数,因为url中的最后一项参数已经被我们设置成了word变量
}

#发起请求
response = requests.post(url,headers = headers,data = data)

'''
#检测可能出现的错误
print(response.text) #经过检测后发现没有报错,data不可能有问题,所以只能是在其它地方出现反爬
'''

'''
#获取响应数据
result = response.json() #报出 JSONDecodeError 错误,说明response并不能满足转换成json对象
print(result)
'''

'''
#获取响应数据
result = response.json()
print(result)
'''
#获取响应数据
result = response.json()

#信息过滤
fanyi = result['data']['fanyi'] #字典取值

'''
字典取值的两种方法:
    1.['key'] -- 取不到值会报错
    2.xxx.get('key') -- 取不到值会返回None
    
        注意:一旦取不到值对于第一种方法需要进行异常处理,而第二种方法则不用去管
'''
print(fanyi)

Information filtering icon:
Insert image description here
Insert image description here
Insert image description here

Guess you like

Origin blog.csdn.net/qiao_yue/article/details/134843450