Use python to crawl glory skin pictures (the most basic question of reptiles, python crawler tutorial, super detailed + complete code)

foreword

Today, we use the python language to realize the skin pictures of all heroes of glory, download and save them locally, and the complete code is at the back of the article.

1. Thinking analysis

website analysis

We can see many heroes on the official website, each hero corresponds to a webpage, and the difference between the URLs of each webpage is the number (you can go and have a look), and the skin picture we want to climb is inside. At that time, I guessed whether there were any rules to follow in the construction of hero website URLs. After analysis, it was found that there were no rules to follow, so we had no choice but to crawl down each hero website from the official website, and then request each one (small Part of it cannot be crawled, it is estimated that it is loaded asynchronously), and the skin picture is obtained.
insert image description here
insert image description here
When we request the hero website, the crawled data is different from the data in the web page.
insert image description here
insert image description here
So I tried to capture the package and found the address of the skin image. After analysis, I found that 518 (each hero page has a specific number) represents a specific logo, 1 represents the first skin image, 2 represents the first skin image, and 3 Represents the first skin picture, and so on (others are like this). insert image description here
Then we only need to construct the skin picture address and send a request to get the picture, but the problem comes again, how do we know the number of skins? ? The skins of heroes are sometimes more and sometimes less, and they are not uniform. In fact, we can find that the skin name corresponds to the picture of the skin in order, such as hero name 1: picture 1 hero name 2: picture 2 hero name 3: picture 3, and so on, so we can know the number of pictures through the hero name, and get the corresponding name for each picture, killing two birds with one stone. Let's talk about the code idea
insert image description here

code ideas

Crawl the URL corresponding to each hero on the official website, and then request one by one to obtain the number of skin pictures (you don’t know how many skin picture addresses to build if you don’t know the number of skin pictures), after building the skin picture address, go to one by one Request the skin address, get the skin picture, and save the picture one by one. That's all for now.

2. Environment configuration

# python+3.8.8
# requests  2.25.1
# parsel   1.6.0

# 安装
# pip install requests==2.25.1
# pip install parsel==1.6.0

# 如果觉得安装太慢,可以用镜像源
# pip install -i https://mirrors.tuna.tsinghua.edu.cn/help/pypi/ requests==2.25.1
# pip install -i https://mirrors.tuna.tsinghua.edu.cn/help/pypi/ parsel==1.6.0

3. Complete code

import requests
import parsel
import os  # 内置库不需要安装
import re  # 内置库不需要安装
import logging  # 内置库不需要安装


# 发送请求
def get_url(url):
    headers = {
    
    
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'accept-language': 'zh-CN,zh;q=0.9',
        'pragma': 'no-cache',
        'referer': 'https://pvp.qq.com/',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36'}

    response = requests.get(url=url, headers=headers)
    # 获取网页编码,避免乱码
    response.encoding = response.apparent_encoding
    return response


# 解析数据
def parser_data(html_data, css_rule, css_1, css_2):
    # 传入html数据
    html = parsel.Selector(html_data)

    # 二次提取
    li = html.css(css_rule)

    data_list = []
    for i in li:
        # 获取英雄名(皮肤名)
        name = i.css(css_1).get()
        # 获取网页地址(图片地址)
        url_data = i.css(css_2).get()
        data_list.append([name, url_data])
    # [[],[],[]......]
    return data_list


def first_data(url):
    html_1 = get_url(url).text
    data_list = parser_data(html_1, '.herolist li', 'a::text', 'a::attr(href)')
    u = []
    for k in data_list:
        # 提取网址中的数字,后面构造图片地址用
        num = re.findall(r'\d+', k[1])[0]
        # 构造每一个英雄的网页地址
        url_ju = 'https://pvp.qq.com/web201605/' + k[1]
        # 替换原先的地址
        k[1] = url_ju
        k.append(num)
        u.append(k)
        print(k)
    # 返回形式[[],[],[],[].....]
    return u


def second_data(url_list):
    total = []

    for g in url_list:
        html_2 = get_url(g[1]).text
        # 皮肤名字爬取不到了,返回为None
        data = parser_data(html_2, '.pic-pf-list ', 'li p::text', '::attr(data-imgname)')
        # 将None值删除
        for y in data:
            y.remove(None)
        # 将英雄名和数字插入
        data[0].insert(0, g[0])
        data[0].insert(1, g[2])
        total.append(data)
        print(data)
    # [[],[],[],[]......]
    return total


def save_file(data, dir_name, file_name):
    '''

    :param data: 图片数据
    :param dir_name: 英雄的目录名
    :param file_name: 保存图片的文件名
    :return:
    '''
    if not os.path.exists('荣耀图片'):
        os.mkdir('荣耀图片')

    if not os.path.exists('王者图片\\' + dir_name):
        os.mkdir('荣耀图片\\' + dir_name)

    with open('荣耀图片\\' + dir_name + '\\' + file_name + '.jpg', mode='wb') as f:
        f.write(data)


# 去除特殊字符,返回集合
def del_str(data_str):
    name_list = []
    jpg_name = data_str.split('|')
    for h in jpg_name:
        t = h.replace('&', '')
        for number in range(10):
            t = t.replace(str(number), '')
        name_list.append(t)
    return name_list, len(name_list)


if __name__ == '__main__':
    # 设置输出的格式
    LOG_FORMAT = "时间:%(asctime)s - 日志等级:%(levelname)s - 日志信息:%(message)s"
    # 对logger进行配置——日志等级&输出格式
    logging.basicConfig(level=logging.INFO, format=LOG_FORMAT)
	
	# 官网地址!!!
    url = ''
    d = first_data(url)

    total_list = second_data(d)
    logging.info('总共' + str(len(total_list)) + '个英雄')
    logging.info('开始爬取图片')
    a = 1
    # 遍历每一条数据
    for n in total_list:
        # 计数 英雄数
        dir_name = str(a) + '.' + n[0][0]
        logging.info('正在爬取' + dir_name + '皮肤')
        # 特定的英雄id
        num_id = n[0][1]
        # 传入皮肤名 返回一个集合 ([皮肤名],皮肤数)
        name_num = del_str(n[0][-1])
        a += 1
        # 构建图片地址,并爬取,皮肤地址是从1开始
        for j in range(1, name_num[1] + 1):
            logging.info('正在爬取' + name_num[0][j - 1])
            jpg_url = f'https://game.gtimg.cn/images/yxzj/img201606/skin/hero-info/{
      
      num_id}/{
      
      num_id}-bigskin-{
      
      j}.jpg'
            # 爬取皮肤图片
            jpg_content = get_url(jpg_url).content
            # 保存图片
            save_file(jpg_content, dir_name, name_num[0][j - 1])
            logging.info('已保存' + name_num[0][j - 1])
    logging.info(str(len(total_list)) + '个英雄' + '爬取完毕!')



Four. Summary

Crawling skin pictures is relatively basic, suitable for beginners to practice crawling. If you climb too many skin pictures, you will be reversed, and some of them can’t be crawled. I remember that you can climb them at the beginning, but then you can’t climb them after you crawl more times. You can try to change the ip and set the delay time. Or use selenium to crawl.
If there is an error in the article, please correct me, and finally thank you! ! !

Guess you like

Origin blog.csdn.net/qq_65898266/article/details/124870582
Recommended