The modules based on network requests in python are:
- urllib module
- requests module √ efficient and concise
requests module:
- Based on network request, powerful, simple and convenient, high efficiency
- Role: Simulate browser request
Requests use steps to split the process:
- Specified url
- Make a request to the url
- Get response data
- Persistent storage
Environment installation:
pip install requests
Code combat:
- Crawling Sogou homepage data
import requests
url = 'https://www.sogou.com/'
if __name__ == "__main__":
#get方法会返回一个响应对象
response = requests.get(url=url) #第二步:获取url页面信息get
#第三步:获取响应数据
page_text = response.text
print(page_text)
#第四步:持久化存储,这里用存在本地的html为例
with open('./sogou.html','w',encoding='utf-8') as fp:
fp.write(page_text)
print("爬取数据结束")
- Sogou search specified dynamic keyword webpage collector
#网页采集器
#动态指定关键词并爬取
#UA伪装,网站的服务器会检测对应请求的载体身份标识,如果检测到请求的载体身份标识为某一款浏览器,说明该请求是正常请求
#如果不是基于浏览器,则该请求为不正常请求
#User-Agent伪装
ua = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3775.400 QQBrowser/10.6.4209.400"
import requests
keyword = "硕了" #可结合input
if __name__ == "__main__":
url = "https://www.sogou.com/web?"
#处理url携带的参数:封装到字典中
headers = {
'User-Agent':ua,
}
param = {
'query': keyword,
}
#对指定url发起的请求对应的url是携带参数的,并且请求过程中已经处理了参数
response = requests.get(url=url,params=param,headers=headers)
page_text = response.text
filename = keyword+'.html'
with open(filename, 'w', encoding='utf-8') as fp:
fp.write(page_text)
print(filename, "爬取成功")
- Crack Baidu translation (partially obtain data)
on Ajax request
#通过抓包发现对应的请求是post请求,响应数据是一组json数据
#post请求携带了参数
#用requests发起请求
#要会利用浏览器抓包工具,这里先验知识是网页采用的是阿贾克斯请求,可以抓包XHR
ua = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3775.400 QQBrowser/10.6.4209.400"
import json
import requests
if __name__ == "__main__":
post_url = "https://fanyi.baidu.com/sug"
#post请求参数处理
data = {
"kw": "dog"
}
headers = {
"User-Agent":ua,
}
#请求发送
response = requests.post(url=post_url,data=data,headers=headers)
#获取响应数据,json方法返回的是一个obj,如果确认服务器响应数据是json类型的,才可以使用json对象返回
dic_obj = response.json()
print(dic_obj)
fp = open('./dog.json','w',encoding='utf-8')
json.dump(dic_obj,fp=fp,ensure_ascii=False) #中文不能使用ascii编码
print("over")
- Crawl the Douban movie classification rankings
ua = "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3775.400 QQBrowser/10.6.4209.400"
import json
import requests
if __name__ == "__main__":
url = "https://movie.douban.com/j/chart/top_list?type=24&interval_id=100%3A90&action=&start=20&limit=20"
param = {
'type': '24',
'interval_id': '100:90',
'action': '',
'start':'20',
'limit':'20',
}
headers = {
"User-Agent":ua,
}
response = requests.get(url=url,params=param,headers=headers)
list_obj = response.json()
print(list_obj)
- Crawl KFC restaurant information
import requests
kw = "广州"
if __name__ == "__main__":
url = "http://www.kfc.com.cn/kfccda/ashx/GetStoreList.ashx?op=keyword"
headers = {
'User-Agent':'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.25 Safari/537.36 Core/1.70.3775.400 QQBrowser/10.6.4198.400'
}
data = {
'cname':'',
'pid': '',
'keyword':kw,
'pageIndex': '1',
'pageSize': '10'
}
response = requests.post(url=url, headers=headers, data=data)
print(response.text)