1. Introduction to the requests module
Function: simulate a browser to send a request.
Installation: Go to the local terminal and enter the following code to install
pip install requests
Guide package:
import requets
Two ways to send requests
GET: Display submission, you can directly see the data submitted by get in the address bar, and the browser directly enters the location of the URL (this time the request is get)
POST: implicit submission, the data submitted by post is generally invisible in the address bar. Form (login, registration, password)
resp = requests. get( url, params= None , ** kwargs)
resp = requests. post( url, data= None , ** kwargs)
'''
需要注意的是:
有时候请求参数是放在request payload里面的,而这类参数需要用json来接收 在请求头中必定跟着application/json
# 在添加请求头中必须加入Content-Type: application/json
'''
headers = {
'Content-Type' : 'application/json'
}
url = 'xxx'
data = {
'xxx' : 'xxx'
}
resp = requests. post( url, json. dumps( data) , headers= headers)
resp = requests. post( url, json= data, headers= headers)
Status code (status code of HTTP protocol):
200 系列 :一般指的是你当前本次和服务器进行通信没有问题
300 系列: 一般指重定向,注意在响应头上能看到location字样
我们在写爬虫的时候,基本上不用管302,因为requests可以自动帮你完成这个重定向的动作
404 系列: 走丢了,你的url不存在,在服务器上人家没有办法给你想要的内容
403 系列: 一般都是被风控拦截了
500 系列: 服务器内部出现了错误
浏览器上啥事没有,你的程序一跑就500,基本上就是你给的参数有问题,让服务器无法正常的工作
当遇到time out(超时)的时候,可以使用下面循环
for i in range(10):
try:
发请求
break
except Exception as e:
print('出错了')
Two, requests actual combat
1. Search by keyword (sogou)
import requests
n = input ( "请输入一个关键词:" )
url = f'https://www.sogou.com/web?query= {
n} '
header = {
"user-agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"
}
resp = requests. get( url, headers= header)
print ( resp. text)
2. Baidu translation (Post exercise)
'''
抓取百度翻译
需求
1. 首先进入到百度翻译 https://fanyi.baidu.com
2. 接着F12打开XHR获取动态加载页面
3. 在翻译框中输入一个单词
4. 查看数据包,并截取自己想要的数据
5. 获取数据
'''
import requests
url = "https://fanyi.baidu.com/sug"
key = input ( '请输入单词: ' )
data = {
"kw" : key
}
resp = requests. post( url, data)
import json
dic = json. loads( resp. text)
for i in dic[ 'data' ] :
name = i[ 'k' ]
mean = i[ 'v' ]
print ( f' {
name} | {
mean} ' )
print ( '=' * 30 )
dic1 = resp. json( )
for i in dic1[ 'data' ] :
name = i[ 'k' ]
mean = i[ 'v' ]
print ( f' {
name} | {
mean} ' )
3. Douban movie top250 (requests advanced)
'''
方法论:
任何一个网站,第一件事,观察你要的东西在不在页面源代码中
如果在
直接请求url即可
若不在
抓包工具观察,数据究竟是从哪个url加载进来的
'''
import requests
url = 'https://movie.douban.com/j/chart/top_list'
parse = {
"type" : "24" ,
"interval_id" : "100:90" ,
"action" : "" ,
"start" : "0" ,
"limit" : "20"
}
header = {
"User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"
}
resp = requests. get( url, params= parse, headers= header)
print ( resp. json( ) )
print ( resp. request. url)
4. Image download (exe, rar, zip, etc.)
import requests
url = 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2797313535.jpg'
resp = requests. get( url)
content = resp. content
with open ( 'hhh.jpg' , 'wb' ) as f:
f. write( content)
5. Douban movie storage
'''
需求
https://movie.douban.com/typerank?type_name=%E7%88%B1%E6%83%85&type=13&interval_id=100:90&action=
尝试抓取前`100`部电影的`名称`, `分数`, `封面图的url`
'''
import requests
with open ( '电影排名.txt' , 'w' ) as f:
for num in range ( 5 ) :
start = num* 20
url = 'https://movie.douban.com/j/chart/top_list'
params = {
"type" : "13" ,
"interval_id" : "100:90" ,
"action" : "" ,
"start" : start,
"limit" : "20"
}
header = {
"""Cookie""" : """ll="118173"; bid=_Wl9BaS0eKI; gr_user_id=909d0c5b-e83f-4391-80e5-5259e3c545d6; ap_v=0,6.0""" ,
"User-Agent" : "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36"
}
resp = requests. get( url, params= params, headers= header)
dic = resp. json( )
for i in dic:
rank = i[ 'rank' ]
title = i[ 'title' ]
score = i[ 'score' ]
img = i[ 'cover_url' ]
f. write( str ( rank) + "|" + title + "|" + score + "|" + img + "\n" )
print ( f'第 {
i+ 1 } 页下载完成!!!' )
print ( "Over!!!!" )
3. Summary of requests
requests.get()
To send a get request, the request parameters can be placed directly url
behind ?
or in a dictionary and passed to params
.
requests.post()
Send a post request, the request parameters should be placed 字典
in it, and passed to data
Request Payload
The parameters need to be data
converted to json
resp.text
Receive 文本
, the essence is to resp.content
carry out decode()
the result. str
resp.json()
Receive 响应中的json字符串, 并将其处理成字典
dict|list
resp.content
receive 字节
bytes