分析过程:
url:https://lol.qq.com/data/info-heros.shtml
没分析请求之前准备用xpath取出每个英雄的链接然后再发送请求取出英雄皮肤图片和皮肤名,该方案可行难度不高
在分析请求后发现有一个js请求里包含了所有英雄的信息,可以取出英雄id和name :https://lol.qq.com/biz/hero/champion.js
每个英雄的信息页面有一个js请求包含了英雄的皮肤数量和皮肤名称,Galio为英雄的名称,是第一个js中获取的英雄名,如: https://lol.qq.com/biz/hero/Galio.js
爬取思路:
1、获取第一个js请求所有英雄的id和name
2、组合所有英雄的js请求url列表:heros_url = []
3、遍历列表,获取每个英雄的皮肤名称和图片链接地址
4、下载图片
代码如下:
# coding: utf-8
import urllib2
import os
import json, re
def get_heros_json():
''' 通过url,获取所有hero的信息集合,json格式为:
{u'133': u'Quinn', u'91': u'Talon'} '''
url = "http://lol.qq.com/biz/hero/champion.js"
response = urllib2.urlopen(url=url)
html = response.read() # .decode("utf-8")
html_json = re.findall(r'LOLherojs.champion=(.+?);', html)
heros_json = json.loads(html_json[0])['keys']
# print heros_json
get_heros_url(heros_json)
# return heros_url
def get_heros_url(heros_json):
''' 遍历传入的json拼接每个的英雄js请求url'''
# 用于存储英雄的请求地址
heros_url = []
for key in heros_json:
# print ("hero is %s; value is %s " % (key, heros_json[key]))
hero_url = "https://lol.qq.com/biz/hero/" + heros_json[key] + ".js"
# return heros_url.append(hero_url)
heros_url.append(hero_url)
# return heros_url
get_hero_info(heros_url)
def get_hero_info(heros_url):
# windos
# 检查文件夹是否存在,不存在则创建
save_dir = '.\\heros\\'
if(not os.path.exists(save_dir)):
os.makedirs(save_dir)
for hero in heros_url:
get_hero(hero)
print ("下载完成!")
# heros_url="https://lol.qq.com/biz/hero/Galio.js"
# get_hero(heros_url)
def get_hero(hero):
''' 获取一个英雄的js请求信息 '''
# print hero
response = urllib2.urlopen(url=hero)
html = response.read()
html_json = re.findall(r"\"data\":(.+?);", html)
# 构造完整的json格式( 缺少{"data": )
html_json = "{\"data\":" + html_json[0]
# print html_json
# 将json转成python的对象
hero_json = json.loads(html_json)
# 设置默认的英雄名字
dafault = hero_json["data"]["name"]
# print dafault
get_download(hero_json, dafault)
def get_download(hero_json, dafault):
''' '''
# 英雄皮肤的列表
hero_skinsjson = hero_json["data"]['skins']
# print hero_skinsjson
i = 0
imgId = ''
imgName = ''
for key in hero_skinsjson:
if i == 0:
imgId = key['id']
imgName = dafault
i += 1
else:
imgId = key['id']
imgName = key['name']
imgName = imgName.replace("/", '').decode("utf-8")
save_dir = ".\\heros\\"
save_file_name = save_dir + imgName + ".jpg"
url = "http://ossweb-img.qq.com/images/lol/web201310/skin/big" + imgId + ".jpg"
# print url
try:
if (not os.path.exists(save_file_name)):
content = urllib2.urlopen(url=url).read()
with open(save_file_name, "wb") as f:
f.write(content)
except Exception:
print("下载失败"+ url + " name is " + imgName)
def main():
heros_url = get_heros_json()
if __name__ == '__main__':
main()
'''
下载失败http://ossweb-img.qq.com/images/lol/web201310/skin/big28006.jpg name is K/DA 伊芙琳
下载失败http://ossweb-img.qq.com/images/lol/web201310/skin/big84009.jpg name is K/DA 阿卡丽
下载失败http://ossweb-img.qq.com/images/lol/web201310/skin/big103015.jpg name is K/DA 阿狸
下载失败http://ossweb-img.qq.com/images/lol/web201310/skin/big145014.jpg name is K/DA 卡莎
下载失败http://ossweb-img.qq.com/images/lol/web201310/skin/big145015.jpg name is K/DA 卡莎 至臻
下载完成!
错误原因是文件名的编码问题,去除了'/'也是不行,会乱码
解决办法 imgName = imgName.replace("/", '').decode("utf-8")
'''
代码也是参考别人的思路加上自己的思考,仅供学习和参考,转发请注明出处
参考网址:https://blog.csdn.net/teak_on_my_way/article/details/81321509