环境:Windows7 +Python3.6+Pycharm2017
目标:抓取美团美食移动端 深圳地区店铺的信息,包括:店铺名称、分类、地址、电话、人均消费、营业时间、评分、评价人数、经纬度。最后抓取2.1W条信息,程序运行约1h。工具(requests、selenium、chrome)
----------其他案例: 京东爬虫 、链家爬虫、微信公众号爬虫
一、美团桌面端
打开深圳美团https://sz.meituan.com/,点击美食,F12进入浏览器开发者模式。点击右上方Network和XHR,然后随便点击一个分区,比如香蜜湖。可以抓到一个请求叫:getPoiList?cityName=XXXXXXXX。点击可以看到请求的url中有一个参数_token。这个token参数应该通过某种算法算出来的,如果要模拟浏览器发请求,首先要知道如何生成token。这个token应该是通过JS生成的,一般遇到js加密的,要么破解加密原理,然后自己用代码实现。要么就是直接调用它的js代码。而且这个参数估计是最近几个月才加进去的,网上查了一遍也没有找到解决办法,自己看js文件也看不出什么,所以桌面端只能放弃。如有大神知道怎么处理这个token,望告知,谢谢!!如果真要拿token,用selenium+chrome应该也可以,每个token应该有一段有效期。
二、美团移动端
桌面端搞不定,只能选择其他途径。现在很多网站都会有桌面版,移动版,还有APP,一般移动版的反爬会简单些。打开美团移动版 https://i.meituan.com/ ,F12打开浏览器开发者模式,可以点击下图1处的两个方框,模拟手机浏览器。
然后点击美食,进入下图界面,看到右边的两个请求。第一个请求是页面的基本框架信息,比如上面各种分类信息,后面会用到。第二个请求list,是一个动态请求,用以获得商家信息。点击发现是一个post请求,请求的参数如下图红框中所示,多点击几家店铺就能看出参数的含义。变化的就四个参数areaId--地区分类、cataId--美食分类、offset--翻页参数、uuid--网站分发的id。
直接模拟浏览器发送post请求,修改offset来实现翻页,每页有15条数据,每翻一页 offset值加15。实测在当前美食页面下直接翻页,最多能翻67页,1005条数据,后面好像出验证码还是没数据返回了。所以我们要对店铺进行分类抓取。
我们需要的信息在店铺的详情页面,一般详情页面的url都是几个关键参数的拼凑,而这几个关键参数是可以在上面的列表页面抓取到的。我们点开一家店铺,观察url,发现主要是两个参数,一个是店铺的id:6268902,还有一个就是ct_poi参数,这两个参数都可以在上面的post请求返回数据中找到。
还有就是我们进入页面详情浏览器能捕捉到很多的请求,我们需要的店铺信息 店铺名称、分类、地址、电话、人均消费、营业时间、评分、评价人数、经纬度,是哪个请求返回的,需要确认下。实际就是第一个请求,上面这个url。
点开第一个请求返回的html代码,直接ctrl+F搜索店铺电话号码,就能找到位置。在一个<script crossorigin='anonymous'>标签中,这样的标签有好几个,需要区分。用xpath解析的时候取标签内容,然后截取内容字符串前16位,看是不是window._appState,以此判断,剩下的就是json数据处理。
三、基本思路
至此,爬取的基本思路就有了。先通过列表页面抓取店铺的id和ct_poi参数,构造详情页面url,再访问详情页面抓取信息。由于翻页只能翻67页,所以我们需要分类抓取。我们这里选择按区域分类,应该这样可以保证每一个区域下店铺数量小于67页(1005条)。店铺总数网站全城虽然显示的是46655,但是下面每个区域加起来应该是2.4W,而且全部类目下显示的也是总数2.4W,所以我觉得应该是总数在2.4W。所以现在的问题就是把每个区域的areaId抓到。
四、区域id抓取
点击美食页面 https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1
查看html代码,也是在一个<script crossorigin='anonymous'>标签中,可以看到每个区域对应的id。只是在浏览器上显示的数据并不完整,可以下载html到本地用编辑器打开。也是json格式数据的处理。这里就是南澳新区的数据要特殊处理下,因为它下面没有分区,我直接把它加到了坪山区内。
五、店铺id和ct_poi参数抓取
有了每个区域的id,可以直接构造post请求获取店铺信息,该请求需要加上cookie,一条cookie就可以抓完。返回数据是json格式,包含15条店铺信息,提取其中的店铺id和ct_poi保存到本地csv文件中。抓取完成后可以对信息做一次去重,店铺id相同的就认为是重复信息。代码中把店铺的分类cateName也保存下来,详情页面好像没有这个信息。代码如下,应该改下cookie就可以运行。去重后一共抓取到21872条数据。
#coding=utf-8
import csv
import time
import requests
import json
#区域店铺id ct_Poi cateName抓取,传入参数为区域id
def crow_id(areaid):
id_list=[]
url='https://meishi.meituan.com/i/api/channel/deal/list'
head={'Host': 'meishi.meituan.com',
'Accept': 'application/json',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Referer': 'https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1',
'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36',
'Cookie':'XXXXXXXXXXXXXX'
}
p = {'https': 'https://27.157.76.75:4275'}
data={"uuid":"09dbb48e-4aed-4683-9ce5-c14b16ae7539","version":"8.3.3","platform":3,"app":"","partner":126,"riskLevel":1,"optimusCode":10,"originUrl":"http://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1","offset":0,"limit":15,"cateId":1,"lineId":0,"stationId":0,"areaId":areaid,"sort":"default","deal_attr_23":"","deal_attr_24":"","deal_attr_25":"","poi_attr_20043":"","poi_attr_20033":""}
r=requests.post(url,headers=head,data=data,proxies=p)
result=json.loads(r.text)
totalcount=result['data']['poiList']['totalCount'] #获取该分区店铺总数,计算出要翻的页数
datas=result['data']['poiList']['poiInfos']
print(len(datas),totalcount)
for d in datas:
d_list=['','','','']
d_list[0]=d['name']
d_list[1] = d['cateName']
d_list[2] = d['poiid']
d_list[3] = d['ctPoi']
id_list.append(d_list)
print('Page:1')
#将数据保存到本地csv
with open('meituan_id.csv','a', newline='',encoding='gb18030')as f:
write=csv.writer(f)
for i in id_list:
write.writerow(i)
#开始爬取第2页到最后一页
offset=0
if totalcount>15:
totalcount-=15
while offset<totalcount:
id_list = []
offset+=15
m=offset/15+1
print('Page:%d'%m)
#构造post请求参数,通过改变offset实现翻页
data2 = {"uuid": "09dbb48e-4aed-4683-9ce5-c14b16ae7539", "version": "8.3.3", "platform": 3, "app": "",
"partner": 126, "riskLevel": 1, "optimusCode": 10,
"originUrl": "http://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1",
"offset": offset, "limit": 15, "cateId": 1, "lineId": 0, "stationId": 0, "areaId": areaid, "sort": "default",
"deal_attr_23": "", "deal_attr_24": "", "deal_attr_25": "", "poi_attr_20043": "", "poi_attr_20033": ""}
try:
r = requests.post(url, headers=head, data=data2,proxies=p)
print(r.text)
result = json.loads(r.text)
datas = result['data']['poiList']['poiInfos']
print(len(datas))
for d in datas:
d_list = ['', '', '', '']
d_list[0] = d['name']
d_list[1] = d['cateName']
d_list[2] = d['poiid']
d_list[3] = d['ctPoi']
id_list.append(d_list)
#保存到本地
with open('meituan_id.csv', 'a', newline='',encoding='gb18030')as f:
write = csv.writer(f)
for i in id_list:
write.writerow(i)
except Exception as e:
print(e)
if __name__=='__main__':
#直接将html代码中区域的信息复制出来,南澳新区的数据需要处理下,它下面没有分区
a = {"areaObj": {"28": [{"id": 28, "name": "全部", "regionName": "福田区", "count": 4022},
{"id": 1056, "name": "香蜜湖", "regionName": "香蜜湖", "count": 105},
{"id": 744, "name": "梅林", "regionName": "梅林", "count": 421},
{"id": 1055, "name": "上沙/下沙", "regionName": "上沙/下沙", "count": 291},
{"id": 2008, "name": "华强南", "regionName": "华强南", "count": 263},
{"id": 742, "name": "八卦岭/园岭", "regionName": "八卦岭/园岭", "count": 217},
{"id": 741, "name": "华强北", "regionName": "华强北", "count": 572},
{"id": 743, "name": "皇岗/水围", "regionName": "皇岗/水围", "count": 136},
{"id": 756, "name": "新城市广场", "regionName": "新城市广场", "count": 140},
{"id": 6595, "name": "车公庙", "regionName": "车公庙", "count": 305},
{"id": 6596, "name": "景田", "regionName": "景田", "count": 144},
{"id": 6597, "name": "新洲/石厦", "regionName": "新洲/石厦", "count": 374},
{"id": 6974, "name": "竹子林", "regionName": "竹子林", "count": 107},
{"id": 6975, "name": "市民中心", "regionName": "市民中心", "count": 39},
{"id": 7993, "name": "会展中心", "regionName": "会展中心", "count": 461},
{"id": 7994, "name": "岗厦", "regionName": "岗厦", "count": 110},
{"id": 7996, "name": "福田保税区", "regionName": "福田保税区", "count": 29}],
"29": [{"id": 29, "name": "全部", "regionName": "罗湖区", "count": 2191},
{"id": 6976, "name": "国贸", "regionName": "国贸", "count": 232},
{"id": 758, "name": "莲塘", "regionName": "莲塘", "count": 125},
{"id": 2009, "name": "笋岗", "regionName": "笋岗", "count": 159},
{"id": 748, "name": "翠竹路沿线", "regionName": "翠竹路沿线", "count": 42},
{"id": 745, "name": "东门", "regionName": "东门", "count": 484},
{"id": 746, "name": "宝安南路沿线", "regionName": "宝安南路沿线", "count": 67},
{"id": 757, "name": "火车站", "regionName": "火车站", "count": 96},
{"id": 6598, "name": "万象城", "regionName": "万象城", "count": 127},
{"id": 6599, "name": "喜荟城/水库", "regionName": "喜荟城/水库", "count": 99},
{"id": 7659, "name": "地王大厦", "regionName": "地王大厦", "count": 85},
{"id": 8469, "name": "黄贝岭", "regionName": "黄贝岭", "count": 136},
{"id": 8470, "name": "春风万佳/文锦渡", "regionName": "春风万佳/文锦渡", "count": 19},
{"id": 8471, "name": "布心/太白路", "regionName": "布心/太白路", "count": 154},
{"id": 8790, "name": "田贝/水贝", "regionName": "田贝/水贝", "count": 85},
{"id": 8794, "name": "银湖/泥岗", "regionName": "银湖/泥岗", "count": 37},
{"id": 8795, "name": "新秀/罗芳", "regionName": "新秀/罗芳", "count": 33},
{"id": 13080, "name": "梧桐山", "regionName": "梧桐山", "count": 34},
{"id": 14095, "name": "KK mall", "regionName": "KK mall", "count": 74}],
"30": [{"id": 30, "name": "全部", "regionName": "南山区", "count": 3905},
{"id": 751, "name": "南头", "regionName": "南头", "count": 325},
{"id": 750, "name": "华侨城", "regionName": "华侨城", "count": 126},
{"id": 749, "name": "蛇口", "regionName": "蛇口", "count": 9},
{"id": 1057, "name": "南油", "regionName": "南油", "count": 218},
{"id": 1058, "name": "科技园", "regionName": "科技园", "count": 460},
{"id": 1059, "name": "西丽", "regionName": "西丽", "count": 586},
{"id": 4811, "name": "南山中心区", "regionName": "南山中心区", "count": 635},
{"id": 6591, "name": "海岸城/保利", "regionName": "海岸城/保利", "count": 158},
{"id": 6592, "name": "前海", "regionName": "前海", "count": 32},
{"id": 6593, "name": "白石洲", "regionName": "白石洲", "count": 190},
{"id": 6594, "name": "欢乐海岸", "regionName": "欢乐海岸", "count": 22},
{"id": 7597, "name": "太古城", "regionName": "太古城", "count": 57},
{"id": 7599, "name": "花园城", "regionName": "花园城", "count": 42},
{"id": 13109, "name": "海上世界", "regionName": "海上世界", "count": 225},
{"id": 23117, "name": "世界之窗", "regionName": "世界之窗", "count": 97},
{"id": 25152, "name": "南山京基百纳", "regionName": "南山京基百纳", "count": 22},
{"id": 36635, "name": "深圳湾", "regionName": "深圳湾", "count": 17}],
"31": [{"id": 31, "name": "全部", "regionName": "盐田区", "count": 407},
{"id": 754, "name": "大小梅沙", "regionName": "大小梅沙", "count": 36},
{"id": 755, "name": "沙头角", "regionName": "沙头角", "count": 118},
{"id": 8789, "name": "东部华侨城", "regionName": "东部华侨城", "count": 11},
{"id": 8796, "name": "盐田海鲜食街", "regionName": "盐田海鲜食街", "count": 22},
{"id": 15349, "name": "壹海城", "regionName": "壹海城", "count": 51},
{"id": 38055, "name": "溪涌", "regionName": "溪涌", "count": ""}],
"32": [{"id": 32, "name": "全部", "regionName": "宝安区", "count": 6071},
{"id": 6587, "name": "西乡", "regionName": "西乡", "count": 15},
{"id": 6586, "name": "新安", "regionName": "新安", "count": 413},
{"id": 6585, "name": "石岩", "regionName": "石岩", "count": 466},
{"id": 752, "name": "宝安中心区", "regionName": "宝安中心区", "count": 458},
{"id": 4653, "name": "港隆城", "regionName": "港隆城", "count": 137},
{"id": 6588, "name": "沙井", "regionName": "沙井", "count": 824},
{"id": 6589, "name": "福永", "regionName": "福永", "count": 631},
{"id": 7684, "name": "松岗", "regionName": "松岗", "count": 435},
{"id": 7685, "name": "公明", "regionName": "公明", "count": 433},
{"id": 7719, "name": "海雅缤纷城", "regionName": "海雅缤纷城", "count": 125},
{"id": 7735, "name": "固戍", "regionName": "固戍", "count": 237},
{"id": 8006, "name": "桃源居", "regionName": "桃源居", "count": 25},
{"id": 14404, "name": "时代城", "regionName": "时代城", "count": 2},
{"id": 17088, "name": "罗田/燕川", "regionName": "罗田/燕川", "count": 45},
{"id": 17089, "name": "西田", "regionName": "西田", "count": 29},
{"id": 17091, "name": "圳美", "regionName": "圳美", "count": 32},
{"id": 17092, "name": "田寮/长圳", "regionName": "田寮/长圳", "count": 3},
{"id": 23524, "name": "沙井京基百纳", "regionName": "沙井京基百纳", "count": 98},
{"id": 27275, "name": "宝立方", "regionName": "宝立方", "count": 125},
{"id": 36634, "name": "宝安机场", "regionName": "宝安机场", "count": 244},
{"id": 37084, "name": "光明新区", "regionName": "光明新区", "count": 1}],
"33": [{"id": 33, "name": "全部", "regionName": "龙岗区", "count": 5193},
{"id": 753, "name": "罗岗/求水山", "regionName": "罗岗/求水山", "count": 145},
{"id": 6600, "name": "五和/民营市场", "regionName": "五和/民营市场", "count": 250},
{"id": 6601, "name": "平湖", "regionName": "平湖", "count": 356},
{"id": 7656, "name": "横岗", "regionName": "横岗", "count": 568},
{"id": 7658, "name": "南澳", "regionName": "南澳", "count": 32},
{"id": 7663, "name": "南联", "regionName": "南联", "count": 311},
{"id": 7664, "name": "坪地", "regionName": "坪地", "count": 131},
{"id": 8472, "name": "大运", "regionName": "大运", "count": 186},
{"id": 9013, "name": "李朗聚星商城", "regionName": "李朗聚星商城", "count": 63},
{"id": 13335, "name": "较场尾/大鹏所城", "regionName": "较场尾/大鹏所城", "count": 152},
{"id": 13358, "name": "水头", "regionName": "水头", "count": 20},
{"id": 13359, "name": "东涌", "regionName": "东涌", "count": 2},
{"id": 13361, "name": "万科广场/世贸", "regionName": "万科广场/世贸", "count": 107},
{"id": 13412, "name": "华南城/奥特莱斯", "regionName": "华南城/奥特莱斯", "count": 191},
{"id": 18069, "name": "大芬/南岭", "regionName": "大芬/南岭", "count": 359},
{"id": 18228, "name": "双龙", "regionName": "双龙", "count": 316},
{"id": 19456, "name": "慢城/三联", "regionName": "慢城/三联", "count": 111},
{"id": 19457, "name": "布吉街/东站/天虹", "regionName": "布吉街/东站/天虹", "count": 404},
{"id": 26297, "name": "天虹/坂田/杨美", "regionName": "天虹/坂田/杨美", "count": 344},
{"id": 26298, "name": "岗头/万科/雪象", "regionName": "岗头/万科/雪象", "count": 199},
{"id": 35919, "name": "华为坂田基地", "regionName": "华为坂田基地", "count": 9},
{"id": 36519, "name": "杨梅坑/桔钓沙", "regionName": "杨梅坑/桔钓沙", "count": 39},
{"id": 36520, "name": "葵涌", "regionName": "葵涌", "count": 37},
{"id": 36530, "name": "官湖", "regionName": "官湖", "count": 9},
{"id": 36531, "name": "西涌", "regionName": "西涌", "count": 49},
{"id": 36636, "name": "坪山高铁站", "regionName": "坪山高铁站", "count": 41},
{"id": 37501, "name": "龙岗中心城", "regionName": "龙岗中心城", "count": 365}],
"9553": [{"id": 9553, "name": "全部", "regionName": "龙华区", "count": 3080},
{"id": 1061, "name": "龙华", "regionName": "龙华", "count": 958},
{"id": 6584, "name": "民治", "regionName": "民治", "count": 164},
{"id": 7721, "name": "观澜", "regionName": "观澜", "count": 433},
{"id": 7722, "name": "大浪", "regionName": "大浪", "count": 398},
{"id": 9326, "name": "梅林关", "regionName": "梅林关", "count": 125},
{"id": 9327, "name": "锦绣江南", "regionName": "锦绣江南", "count": 33},
{"id": 36633, "name": "深圳北站", "regionName": "深圳北站", "count": 190},
{"id": 37723, "name": "龙华新区", "regionName": "龙华新区", "count": 14}],
"23420": [{"id": 23420, "name": "全部", "regionName": "坪山区", "count": 393},
{"id": 6602, "name": "坪山", "regionName": "坪山", "count": 232},
{"id": 23429, "name": "坑梓/竹坑", "regionName": "坑梓/竹坑", "count": 128},
{"id": 9535, "name": "南澳大鹏新区", "regionName": "南澳大鹏新区", "count": 91}]
}}
datas = a['areaObj']
b = datas.values()
area_list=[]
for data in b:
for d in data[1:]:
area_list.append(d) #将每个区域信息保存到列表,元素是字典
l=0
old=time.time()
for i in area_list:
l+=1
print('开始抓取第%d个区域:'%l,i['regionName'], '店铺总数:',i['count'])
try:
crow_id(i['id'])
now=time.time()-old
print(i['name'],'抓取完成!','时间:%d'%now)
except Exception as e:
print(e)
六、店铺详情页面抓取
店铺详情页面的url已经可以构造,现在就是直接访问。就是一个简单的get请求,但是要带上完整的cookie,cookie有问题的话很快会弹验证码。一个cookie可以爬1000次后才会出现验证码,但是也有几百次出现的。用requests的session模块好像拿不到完整的cookie,本文是用selenium+chrome,使用代理ip访问美团,然后获取cookie,再把cookie和ip返回用以发起requests请求。实际测试中出现验证码后不换cookie,只更换ip也可以继续抓取。
代码有两块,一个是主程序,还有一个get_cookie文件,用以cookie、ip的获取处理的,还有页面详情的解析模块。cookie、ip处理函数,先提取一个ip(我买的代理),然后访问美团深圳首页,sleep几秒,这个很关键,让页面完全加载,不然会少cookie。再访问美食页面。ip质量良莠不齐,使用前最好先测试下。这里用访问美食页面所需的时间来判断,大于3S的NG,重新提取ip。小于三秒的ok。然后获取下cookie,这里需要判断cookie是否完整,主要是_utma、_utmc、_utmz这几个参数有时会缺失,没有这几个参数很快会弹验证码,一般cookie长度18。页面解析函数也很简单,返回一个标志位mark和店铺信息info,标志位用以判断本次抓取是否成功。
主函数采用了多线程,比较简单,先获取ip、cookie,再开始爬取。需要注意的是爬取过程中异常的处理。主要异常有两种,一个是timeout:这种异常先sleep1秒,再抓一次,还是不行的话就判断本条抓取失败,如果连续三条抓取失败就需要重新获取ip、cookie。还有就是直接报‘由于目标计算机积极拒绝,无法连接’,访问次数太频繁了,被服务器识别了,就需要重新获取ip、cookie。
get_cookie 模块代码如下:
from selenium import webdriver
import requests
import time
import json
from lxml import etree
#返回一个ip和对应的cookie,cookie以字符串形式返回。ip需要经过测试
def get_cookie():
mark=0
while mark==0:
#购买的ip获取地址
p_url = 'XXXXXXXXXXXXX'
r = requests.get(p_url)
html = json.loads(r.text)
a = html['data'][0]['ip']
b = html['data'][0]['port']
val = '--proxy-server=http://' + str(a) + ':' + str(b)
val2 = 'https://' + str(a) + ':' + str(b)
p = {'https': val2}
print('获取IP:',p)
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(val)
driver = webdriver.Chrome(executable_path='C:\Program Files (x86)\Google\Chrome\Application\chromedriver.exe',chrome_options=chrome_options)
driver.set_page_load_timeout(8) #设置超时
driver.set_script_timeout(8)
url='https://i.meituan.com/shenzhen/' #美团深圳首页
url2='https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1'#美食页面
try:
driver.get(url)
time.sleep(2.5)
c1=driver.get_cookies()
now = time.time()
driver.get(url2)
tt=time.time()-now
print(tt)
time.sleep(0.5)
#ip速度测试,打开时间大于3S的NG
if tt < 3:
c=driver.get_cookies()
driver.quit()
print('*******************')
print(len(c1),len(c))
#判断cookie是否完整,正常的长度应该是18
if len(c)>17:
mark=1
# print(c)
x={}
for line in c:
x[line['name']]=line['value']
#将cookie合成字符串,以便添加到header中,字符串较长就分了两段处理
co1='__mta='+x['__mta']+'; client-id='+x['client-id']+'; IJSESSIONID='+x['IJSESSIONID']+'; iuuid='+x['iuuid']+'; ci=30; cityname=%E6%B7%B1%E5%9C%B3; latlng=; webp=1; _lxsdk_cuid='+x['_lxsdk_cuid']+'; _lxsdk='+x['_lxsdk']
co2='; __utma='+x['__utma']+'; __utmc='+x['__utmc']+'; __utmz='+x['__utmz']+'; __utmb='+x['__utmb']+'; i_extend='+x['i_extend']+'; uuid='+x['uuid']+'; _hc.v='+x['_hc.v']+'; _lxsdk_s='+x['_lxsdk_s']
co=co1+co2
print(co)
return(p,co)
else:
print('缺少Cookie,长度:',len(c))
else:
print('超时')
driver.quit()
time.sleep(3)
except:
driver.quit()
pass
#解析店铺详情页面,返回店铺信息info和一个标志位mark
#传入参数u包含url和店铺分类,pc包含cookie和ip,m代表抓取的数量,n表示线程号,ll表示剩余店铺数量,ttt该线程抓取的总时长
def parse(u,pc,m,n,ll,ttt):
mesg='Thread:'+str(n)+' No:'+str(m)+' Time:'+str(ttt)+' left:'+str(ll)#记录当前线程爬取的信息
url = u[0]
cate = u[1]
p=pc[0]
cookie=pc[1]
mark = 0 #标志位,0表示抓取正常,1,2表示两种异常
head = {'Host': 'meishi.meituan.com',
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
'Cache-Control': 'max-age=0',
'Connection': 'keep-alive',
'Upgrade - Insecure - Requests': '1',
'Referer': 'https://meishi.meituan.com/i/?ci=30&stid_b=1&cevent=imt%2Fhomepage%2Fcategory1%2F1',
'User-Agent': 'Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.139 Mobile Safari/537.36',
'Cookie':cookie
}
info = [] #店铺信息存储
try:
r = requests.get(url, headers=head, timeout=3, proxies=p)
r.encoding = 'utf-8'
html = etree.HTML(r.text)
datas = html.xpath('body/script[@crossorigin="anonymous"]')
for data in datas:
try:
strs = data.text[:16]
if strs == 'window._appState':
result = data.text[19:-1]
result = json.loads(result)
name = result['poiInfo']['name']
addr = result['poiInfo']['addr']
phone = result['poiInfo']['phone']
aveprice = result['poiInfo']['avgPrice']
opentime = result['poiInfo']['openInfo']
opentime = opentime.replace('\n', ' ')
avescore = result['poiInfo']['avgScore']
marknum = result['poiInfo']['MarkNumbers']
lng = result['poiInfo']['lng']
lat = result['poiInfo']['lat']
info = [name, cate, addr, phone, aveprice, opentime, avescore, marknum, lng, lat]
print(url)
print(mesg,name, cate, addr, phone, aveprice, opentime, avescore, marknum, lng, lat)
except:
pass
except Exception as e:
print('Error Thread:',n) #打印出异常的线程号
print(e)
s = str(e)[-22:-6]
if s == '由于目标计算机积极拒绝,无法连接':
print('由于目标计算机积极拒绝,无法连接',n)
mark=1 #1类错误,需要更换ip
else:
mark=2 #2类错误,再抓取一次
return(mark,info) #返回标志位和店铺信息
主函数模块代码如下:
import csv
import time
import threading
from get_cookie import get_cookie
from get_cookie import parse
def crow(n,l): #参数n 区分第几个线程,l存储url的列表
lock=threading.Lock()
sym=0 #是否连续三次抓取失败的标志位
pc=get_cookie() #获取IP 和 Cookie
m=0 #记录抓取的数量
now=time.time()
while True:
if len(l)>0:
u=l.pop(0)
ll=len(l)
m+=1
ttt=time.time()-now
result=parse(u,pc,m,n,ll,ttt)
mark=result[0]
info=result[1]
if mark==2:
time.sleep(1.5)
result = parse(u, pc,m,n,ll,ttt)
mark = result[0]
info = result[1]
if mark !=0:
sym+=1
if mark==1:
pc=get_cookie()
result = parse(u, pc,m,n,ll,ttt)
mark = result[0]
info = result[1]
if mark !=0:
sym+=1
if mark==0: #抓取成功
sym=0
lock.acquire()
with open('meituan.csv', 'a', newline='', encoding='gb18030')as f:
write = csv.writer(f)
write.writerow(info)
f.close()
lock.release()
if sym>2: #连续三次抓取失败,换ip、cookie
sym=0
pc=get_cookie()
else:
print('&&&&线程:%d结束'%n)
break
if __name__=='__main__':
url_list=[]
with open('mt_id.csv','r',encoding='gb18030')as f:
read=csv.reader(f)
for line in read:
d_list=['','']
url='https://meishi.meituan.com/i/poi/'+str(line[2])+'?ct_poi='+str(line[3])
d_list[0]=url
d_list[1]=line[1]
url_list.append(d_list)
f.close()
th_list=[]
for i in range(1,6):
t=threading.Thread(target=crow,args=(i,url_list,))
print('*****线程%d开始启动...'%i)
t.start()
th_list.append(t)
time.sleep(30)
for t in th_list:
t.join()
七、结果
开5个线程的话应该一个小时就可以抓完,最后一共抓取到21828条数据,丢了不到50条数据。
水平有限,如有错误望指正。还有桌面版的抓取如有解决方法望告知,谢谢。
更多案例持续更新,欢迎关注个人公众号!