【爬虫】-爬取LOL所有英雄图片和皮肤图片（使用Python2）

分析过程:

url：https://lol.qq.com/data/info-heros.shtml

没分析请求之前准备用xpath取出每个英雄的链接然后再发送请求取出英雄皮肤图片和皮肤名，该方案可行难度不高

在分析请求后发现有一个js请求里包含了所有英雄的信息，可以取出英雄id和name ：https://lol.qq.com/biz/hero/champion.js

每个英雄的信息页面有一个js请求包含了英雄的皮肤数量和皮肤名称，Galio为英雄的名称，是第一个js中获取的英雄名，如： https://lol.qq.com/biz/hero/Galio.js

爬取思路：

1、获取第一个js请求所有英雄的id和name

2、组合所有英雄的js请求url列表：heros_url = []

3、遍历列表，获取每个英雄的皮肤名称和图片链接地址

4、下载图片

代码如下：

# coding: utf-8

import urllib2
import os
import json, re

def get_heros_json():
	''' 通过url，获取所有hero的信息集合，json格式为:
	{u'133': u'Quinn', u'91': u'Talon'} '''
	url = "http://lol.qq.com/biz/hero/champion.js"
	response = urllib2.urlopen(url=url)
	html = response.read() # .decode("utf-8")
	html_json = re.findall(r'LOLherojs.champion=(.+?);', html)
	heros_json = json.loads(html_json[0])['keys']
	# print heros_json
	get_heros_url(heros_json)
	# return heros_url

def get_heros_url(heros_json):
	''' 遍历传入的json拼接每个的英雄js请求url'''
	# 用于存储英雄的请求地址
	heros_url = []
	for key in heros_json:
		# print ("hero is  %s; value is %s " % (key, heros_json[key]))
		hero_url = "https://lol.qq.com/biz/hero/" + heros_json[key] + ".js"
		# return heros_url.append(hero_url)
		heros_url.append(hero_url)
	# return heros_url
	get_hero_info(heros_url)


def get_hero_info(heros_url):
	# windos
	# 检查文件夹是否存在，不存在则创建
	save_dir = '.\\heros\\'
	if(not os.path.exists(save_dir)):
		os.makedirs(save_dir)

	for hero in heros_url:
		get_hero(hero)
	print ("下载完成！")

	# heros_url="https://lol.qq.com/biz/hero/Galio.js"
	# get_hero(heros_url)	


def get_hero(hero):
	''' 获取一个英雄的js请求信息 '''
	# print hero
	response = urllib2.urlopen(url=hero)
	html = response.read()
	html_json = re.findall(r"\"data\":(.+?);", html)
	# 构造完整的json格式( 缺少{"data": )
	html_json = "{\"data\":" + html_json[0]
	# print html_json

	# 将json转成python的对象
	hero_json = json.loads(html_json)
	# 设置默认的英雄名字
	dafault = hero_json["data"]["name"]
	# print dafault

	get_download(hero_json, dafault)
	
def get_download(hero_json, dafault):
	'''   '''
	# 英雄皮肤的列表
	hero_skinsjson = hero_json["data"]['skins']
	# print hero_skinsjson

	i = 0
	imgId = ''
	imgName = ''
	for key in hero_skinsjson:
		if i == 0:
			imgId = key['id']
			imgName = dafault
			i += 1
		else:
			imgId = key['id']
			imgName = key['name']
		imgName = imgName.replace("/", '').decode("utf-8")		
		save_dir = ".\\heros\\"
		save_file_name = save_dir + imgName + ".jpg"
		url =  "http://ossweb-img.qq.com/images/lol/web201310/skin/big" + imgId + ".jpg"
		# print url
		try:
			if (not os.path.exists(save_file_name)):
				content = urllib2.urlopen(url=url).read()
				with open(save_file_name, "wb") as f:
					f.write(content)
		except Exception:
			print("下载失败"+ url + "  name is " + imgName)

def main():
	heros_url = get_heros_json()


if __name__ == '__main__':
	main()


'''
下载失败http://ossweb-img.qq.com/images/lol/web201310/skin/big28006.jpg  name is K/DA 伊芙琳
下载失败http://ossweb-img.qq.com/images/lol/web201310/skin/big84009.jpg  name is K/DA 阿卡丽
下载失败http://ossweb-img.qq.com/images/lol/web201310/skin/big103015.jpg  name is K/DA 阿狸
下载失败http://ossweb-img.qq.com/images/lol/web201310/skin/big145014.jpg  name is K/DA 卡莎
下载失败http://ossweb-img.qq.com/images/lol/web201310/skin/big145015.jpg  name is K/DA 卡莎 至臻
下载完成！
错误原因是文件名的编码问题，去除了'/'也是不行，会乱码

解决办法 imgName = imgName.replace("/", '').decode("utf-8")	
'''

代码也是参考别人的思路加上自己的思考，仅供学习和参考，转发请注明出处

参考网址：https://blog.csdn.net/teak_on_my_way/article/details/81321509

【爬虫】-爬取LOL所有英雄图片和皮肤图片（使用Python2）

猜你喜欢