Article directory
About World of Tanks
"World of Tanks" (World of Tanks, WOT)
is an online war game that I played during my undergraduate studies and Wargaming
was developed by a company. It was first launched in Russia on October 30, 2010, launched in North America and Europe on April 12, 2011, and launched in China on March 15, 2011 through Kongkong.com (in 2020, the national server is represented by 360). The background of the game is set during World War II. Players will play tanks from the 1930s to the 1960s to fight, requiring strategy and cooperation. The tanks in the game are restored to a high degree of history.
World of Tanks official website: https://wotgame.cn/World
of Tanks Tank Encyclopedia: https://wotgame.cn/zh-cn/tankopedia/#wot&w_m=tanks
1. Crawler task
There are currently WOT
five tank types and 11 series. We are going to build a knowledge graph about tanks encyclopedia, and then we will use crawlers to obtain detailed information about all tanks, such as tank levels, firepower, mobility, protection capabilities, reconnaissance capabilities, etc. Taking the current level eight overlord Chinese heavy tank BZ-176
as an example, the details of the tank are as follows:
2. Get tank list
For routine operation, F12+F5
check the page information and locate the specific request for the tank list:
It is a POST
request that returns JSON
data in a format that contains some basic information about this type of tank:
POST
The request parameters are as follows:
Special note:
header
When constructing this request,Content-Length
parameters are required.
Code:
# -*- coding: utf-8 -*-
# Author : xiayouran
# Email : [email protected]
# Datetime: 2023/9/29 22:43
# Filename: spider_wot.py
import os
import time
import json
import requests
class WOTSpider:
def __init__(self):
self.base_headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
'Chrome/117.0.0.0 Safari/537.36',
'Accept-Encoding': 'gzip, deflate, br',
'Accept-Language': 'zh-CN,zh;q=0.9',
}
self.post_headers = {
'Accept': 'application/json, text/javascript, */*; q=0.01',
'Content-Length': '135',
'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'
}
self.from_data = {
'filter[nation]': '',
'filter[type]': 'lightTank',
'filter[role]': '',
'filter[tier]': '',
'filter[language]': 'zh-cn',
'filter[premium]': '0,1'
}
self.tank_list_url = 'https://wotgame.cn/wotpbe/tankopedia/api/vehicles/by_filters/'
self.tank_label = ['lightTank', 'mediumTank', 'heavyTank', 'AT-SPG', 'SPG']
self.tanks = {
}
def parser_tanklist_html(self, html_text):
json_data = json.loads(html_text)
for data in json_data['data']['data']:
self.tanks[data[0] + '_' + data[4]] = {
'tank_nation': data[0],
'tank_type': data[1],
'tank_rank': data[3],
'tank_name': data[4],
'tank_name_s': data[5],
'tank_url': data[6],
'tank_id': data[7]
}
def run(self):
for label in self.tank_label:
self.from_data['filter[type]'] = label
html_text = self.get_html(self.tank_list_url, method='POST', from_data=self.from_data)
if not html_text:
print('[{}] error'.format(label))
continue
self.parser_tanklist_html(html_text)
time.sleep(3)
self.save_json(os.path.join(self.data_path, 'tank_list.json'), self.tanks)
if __name__ == '__main__':
tank_spider = WOTSpider()
tank_spider.run()
The above code only implements some important function and variable declarations. The complete code can be github
pulled from above: WOT
3. Obtain detailed information about tanks
The page with specific tank information is a pure HTML
page and GET
can be obtained with just one request. Of course, I won’t go into details on how to analyze it. Students who are interested in crawler technology can look for information. Here I will only talk about the crawling process.
First analyze GET
the request: https://wotgame.cn/zh-cn/tankopedia/60209-Ch47_BZ_176/
, which can be divided into three parts:
Part 1
:Basic url
request: https://wotgame.cn/zh-cn/tankopedia
;
Part 2
:Tank's id
: BZ-176
Tank's id
is , each tank is unique, this parameter can be obtained 60209
through the request in the previous step ; :Tank's name: , This parameter can also be obtained through the request in the previous step . This way you can construct a corresponding one for each tank and just parse the corresponding interface. When analyzing, I divided it into two parts. First, I analyzed the basic information of the tank, such as tank type, level, price, etc., which was implemented by the library. The specific information of the tank, such as firepower, maneuverability, protection and reconnaissance capabilities, this information It is obtained by dynamic request of the code. For the sake of simplicity, the specific code is not analyzed here. Instead, the library is first used for web page rendering, and then the library is used for analysis. I won’t go into details here, but the code for page parsing is given below:POST
Part 3
Ch47_BZ_176
POST
url
url
BeautifulSoup
JavaScript
js
selenium
BeautifulSoup
# -*- coding: utf-8 -*-
# Author : xiayouran
# Email : [email protected]
# Datetime: 2023/9/29 22:43
# Filename: spider_wot.py
import requests
from tqdm import tqdm
from bs4 import BeautifulSoup, Tag
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
class WOTSpider:
def __init__(self):
pass
def is_span_with_value(self, driver):
try:
element = driver.find_element(By.XPATH, "//span[@data-bind=\"text: ttc().getFormattedBestParam('maxHealth', 'gt')\"]")
data = element.text.strip()
if data:
return True
except:
return False
def get_html_driver(self, url):
self.driver.get(url)
self.wait.until(self.is_span_with_value)
page_source = self.driver.page_source
return page_source
def parser_tankinfo_html(self, html_text):
tank_info = copy.deepcopy(self.tank_info)
soup = BeautifulSoup(html_text, 'lxml')
# tank_name = soup.find(name='h1', attrs={'class': 'garage_title garage_title__inline js-tank-title'}).strip()
tank_statistic = soup.find_all(name='div', attrs={
'class': 'tank-statistic_item'})
for ts in tank_statistic:
ts_text = [t for t in ts.get_text().split('\n') if t]
if len(ts_text) == 5:
tank_info['价格'] = {
'银币': ts_text[-3],
'经验': ts_text[-1]
}
else:
tank_info[ts_text[0]] = ts_text[-1]
tank_property1 = soup.find(name='p', attrs='garage_objection')
tank_property2 = soup.find(name='p', attrs='garage_objection garage_objection__collector')
if tank_property1:
tank_info['性质'] = tank_property1.text
elif tank_property2:
tank_info['性质'] = tank_property2.text
else:
tank_info['性质'] = '银币坦克'
tank_desc_tag = soup.find(name='p', attrs='tank-description_notification')
if tank_desc_tag:
tank_info['历史背景'] = tank_desc_tag.text
tank_parameter = soup.find_all(name='div', attrs={
'class': 'specification_block'})
for tp_tag in tank_parameter:
param_text = tp_tag.find_next(name='h2', attrs={
'class': 'specification_title specification_title__sub'}).get_text()
# spec_param = tp_tag.find_all_next(name='div', attrs={'class': 'specification_item'})
spec_param = [tag for tag in tp_tag.contents if isinstance(tag, Tag) and tag.attrs['class'] == ['specification_item']]
spec_info = {
}
for tp in spec_param:
tp_text = [t for t in tp.get_text().replace(' ', '').split('\n') if t]
if not tp_text or not tp_text[0][0].isdigit():
continue
spec_info[tp_text[-1]] = ' '.join(tp_text[:-1])
tank_info[param_text] = spec_info
return tank_info
def run(self):
file_list = [os.path.basename(file)[:-5] for file in glob.glob(os.path.join(self.data_path, '*.json'))]
for k, item in tqdm(self.tanks.items(), desc='Crawling'):
file_name = k.replace('"', '').replace('“', '').replace('”', '').replace('/', '-').replace('\\', '').replace('*', '+')
if file_name in file_list:
continue
tank_url = self.tank_url + str(item['tank_id']) + '-' + item['tank_url']
html_text = self.get_html_driver(tank_url)
# html_text = self.get_html(tank_url, method='GET')
tank_info = self.parser_tankinfo_html(html_text)
self.tanks[k].update(tank_info)
self.save_json(os.path.join(self.data_path, '{}.json'.format(file_name)), self.tanks[k])
time.sleep(1.5)
self.save_json(os.path.join(self.data_path, 'tank_list_detail.json'), self.tanks)
if __name__ == '__main__':
tank_spider = WOTSpider()
tank_spider.run()
All tank information can be obtained in about half an hour, as follows:
Selenium
For library dependencieschromedriver
, you need to download the appropriate version according to yourChrome
browser version.chromedriver
The official download address is: https://chromedriver.chromium.org/downloads/version-selection
Conclusion
The complete code and crawling results of this article have been synchronized to the warehouse. If you are interested, you can pull it. The next article will construct a knowledge graph about tank encyclopedia based on the currently obtained tank information.
Open source code repository
If you like it, remember to give my GitHub
warehouse WOT a star! ヾ(≧∇≦*)ヾ
The public account has been opened:
夏小悠
, follow to get more information aboutPython
articles,AI
latest technologies in the field, LLM large model-related papers and internalPPT
information ^_^