World of Tanks WOT Knowledge Map Trilogy - Reptile Chapter

About World of Tanks

  "World of Tanks" (World of Tanks, WOT)is an online war game that I played during my undergraduate studies and Wargamingwas developed by a company. It was first launched in Russia on October 30, 2010, launched in North America and Europe on April 12, 2011, and launched in China on March 15, 2011 through Kongkong.com (in 2020, the national server is represented by 360). The background of the game is set during World War II. Players will play tanks from the 1930s to the 1960s to fight, requiring strategy and cooperation. The tanks in the game are restored to a high degree of history.

  World of Tanks official website: https://wotgame.cn/World
  of Tanks Tank Encyclopedia: https://wotgame.cn/zh-cn/tankopedia/#wot&w_m=tanks

Insert image description here

1. Crawler task

Insert image description here
  There are currently WOTfive tank types and 11 series. We are going to build a knowledge graph about tanks encyclopedia, and then we will use crawlers to obtain detailed information about all tanks, such as tank levels, firepower, mobility, protection capabilities, reconnaissance capabilities, etc. Taking the current level eight overlord Chinese heavy tank BZ-176as an example, the details of the tank are as follows:

Insert image description here
Insert image description here

2. Get tank list

Insert image description here
  For routine operation, F12+F5check the page information and locate the specific request for the tank list:

Insert image description here
  It is a POSTrequest that returns JSONdata in a format that contains some basic information about this type of tank:

Insert image description here
  POSTThe request parameters are as follows:

Insert image description here

  Special note: headerWhen constructing this request, Content-Lengthparameters are required.

  Code:

# -*- coding: utf-8 -*-
# Author  : xiayouran
# Email   : [email protected]
# Datetime: 2023/9/29 22:43
# Filename: spider_wot.py
import os
import time
import json
import requests


class WOTSpider:
    def __init__(self):
        self.base_headers = {
    
    
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                          'Chrome/117.0.0.0 Safari/537.36',
            'Accept-Encoding': 'gzip, deflate, br',
            'Accept-Language': 'zh-CN,zh;q=0.9',
        }
        self.post_headers = {
    
    
            'Accept': 'application/json, text/javascript, */*; q=0.01',
            'Content-Length': '135',
            'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'
        }
        self.from_data = {
    
    
            'filter[nation]': '',
            'filter[type]': 'lightTank',
            'filter[role]': '',
            'filter[tier]': '',
            'filter[language]': 'zh-cn',
            'filter[premium]': '0,1'
        }

        self.tank_list_url = 'https://wotgame.cn/wotpbe/tankopedia/api/vehicles/by_filters/'
        self.tank_label = ['lightTank', 'mediumTank', 'heavyTank', 'AT-SPG', 'SPG']
        self.tanks = {
    
    }

    def parser_tanklist_html(self, html_text):
        json_data = json.loads(html_text)
        for data in json_data['data']['data']:
            self.tanks[data[0] + '_' + data[4]] = {
    
    
                'tank_nation': data[0],
                'tank_type': data[1],
                'tank_rank': data[3],
                'tank_name': data[4],
                'tank_name_s': data[5],
                'tank_url': data[6],
                'tank_id': data[7]
            }
            
    def run(self):
        for label in self.tank_label:
            self.from_data['filter[type]'] = label
            html_text = self.get_html(self.tank_list_url, method='POST', from_data=self.from_data)
            if not html_text:
                print('[{}] error'.format(label))
                continue
            self.parser_tanklist_html(html_text)
            time.sleep(3)
        self.save_json(os.path.join(self.data_path, 'tank_list.json'), self.tanks)


if __name__ == '__main__':
    tank_spider = WOTSpider()
    tank_spider.run()

  The above code only implements some important function and variable declarations. The complete code can be githubpulled from above: WOT

3. Obtain detailed information about tanks

  The page with specific tank information is a pure HTMLpage and GETcan be obtained with just one request. Of course, I won’t go into details on how to analyze it. Students who are interested in crawler technology can look for information. Here I will only talk about the crawling process.
  First analyze GETthe request: https://wotgame.cn/zh-cn/tankopedia/60209-Ch47_BZ_176/, which can be divided into three parts:
  Part 1:Basic urlrequest: https://wotgame.cn/zh-cn/tankopedia;
  Part 2:Tank's id: BZ-176Tank's idis , each tank is unique, this parameter can be obtained 60209through the request in the previous step ; :Tank's name: , This parameter can also be obtained through the request in the previous step .   This way you can construct a corresponding one for each tank and just parse the corresponding interface. When analyzing, I divided it into two parts. First, I analyzed the basic information of the tank, such as tank type, level, price, etc., which was implemented by the library. The specific information of the tank, such as firepower, maneuverability, protection and reconnaissance capabilities, this information It is obtained by dynamic request of the code. For the sake of simplicity, the specific code is not analyzed here. Instead, the library is first used for web page rendering, and then the library is used for analysis. I won’t go into details here, but the code for page parsing is given below:POST
  Part 3Ch47_BZ_176POST
urlurlBeautifulSoupJavaScriptjsseleniumBeautifulSoup

# -*- coding: utf-8 -*-
# Author  : xiayouran
# Email   : [email protected]
# Datetime: 2023/9/29 22:43
# Filename: spider_wot.py
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup, Tag
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait

class WOTSpider:
    def __init__(self):
        pass
        
    def is_span_with_value(self, driver):
        try:
            element = driver.find_element(By.XPATH, "//span[@data-bind=\"text: ttc().getFormattedBestParam('maxHealth', 'gt')\"]")
            data = element.text.strip()
            if data:
                return True
        except:
            return False

    def get_html_driver(self, url):
        self.driver.get(url)
        self.wait.until(self.is_span_with_value)
        page_source = self.driver.page_source

        return page_source

    def parser_tankinfo_html(self, html_text):
        tank_info = copy.deepcopy(self.tank_info)
        soup = BeautifulSoup(html_text, 'lxml')
        # tank_name = soup.find(name='h1', attrs={'class': 'garage_title garage_title__inline js-tank-title'}).strip()
        tank_statistic = soup.find_all(name='div', attrs={
    
    'class': 'tank-statistic_item'})
        for ts in tank_statistic:
            ts_text = [t for t in ts.get_text().split('\n') if t]
            if len(ts_text) == 5:
                tank_info['价格'] = {
    
    
                    '银币': ts_text[-3],
                    '经验': ts_text[-1]
                }
            else:
                tank_info[ts_text[0]] = ts_text[-1]

        tank_property1 = soup.find(name='p', attrs='garage_objection')
        tank_property2 = soup.find(name='p', attrs='garage_objection garage_objection__collector')
        if tank_property1:
            tank_info['性质'] = tank_property1.text
        elif tank_property2:
            tank_info['性质'] = tank_property2.text
        else:
            tank_info['性质'] = '银币坦克'

        tank_desc_tag = soup.find(name='p', attrs='tank-description_notification')
        if tank_desc_tag:
            tank_info['历史背景'] = tank_desc_tag.text

        tank_parameter = soup.find_all(name='div', attrs={
    
    'class': 'specification_block'})
        for tp_tag in tank_parameter:
            param_text = tp_tag.find_next(name='h2', attrs={
    
    'class': 'specification_title specification_title__sub'}).get_text()
            # spec_param = tp_tag.find_all_next(name='div', attrs={'class': 'specification_item'})
            spec_param = [tag for tag in tp_tag.contents if isinstance(tag, Tag) and tag.attrs['class'] == ['specification_item']]
            spec_info = {
    
    }
            for tp in spec_param:
                tp_text = [t for t in tp.get_text().replace(' ', '').split('\n') if t]
                if not tp_text or not tp_text[0][0].isdigit():
                    continue
                spec_info[tp_text[-1]] = ' '.join(tp_text[:-1])
            tank_info[param_text] = spec_info

        return tank_info

    def run(self):
        file_list = [os.path.basename(file)[:-5] for file in glob.glob(os.path.join(self.data_path, '*.json'))]

        for k, item in tqdm(self.tanks.items(), desc='Crawling'):
            file_name = k.replace('"', '').replace('“', '').replace('”', '').replace('/', '-').replace('\\', '').replace('*', '+')
            if file_name in file_list:
                continue
            tank_url = self.tank_url + str(item['tank_id']) + '-' + item['tank_url']
            html_text = self.get_html_driver(tank_url)
            # html_text = self.get_html(tank_url, method='GET')
            tank_info = self.parser_tankinfo_html(html_text)
            self.tanks[k].update(tank_info)
            self.save_json(os.path.join(self.data_path, '{}.json'.format(file_name)), self.tanks[k])
            time.sleep(1.5)
        self.save_json(os.path.join(self.data_path, 'tank_list_detail.json'), self.tanks)


if __name__ == '__main__':
    tank_spider = WOTSpider()
    tank_spider.run()

  All tank information can be obtained in about half an hour, as follows:

Insert image description here

  SeleniumFor library dependencies chromedriver, you need to download the appropriate version according to your Chromebrowser version. chromedriverThe official download address is: https://chromedriver.chromium.org/downloads/version-selection

Conclusion

  The complete code and crawling results of this article have been synchronized to the warehouse. If you are interested, you can pull it. The next article will construct a knowledge graph about tank encyclopedia based on the currently obtained tank information.

Open source code repository


  If you like it, remember to give my GitHubwarehouse WOT a star! ヾ(≧∇≦*)ヾ


  The public account has been opened: 夏小悠, follow to get more information about Pythonarticles, AIlatest technologies in the field, LLM large model-related papers and internal PPTinformation ^_^

Guess you like

Origin blog.csdn.net/qq_42730750/article/details/133562570