A string discretization
The figure is the data we need to analyze this, there is a tags label which means that each hero's attributes, each attribute has multiple heroes, they house a list (type string) that we do is the first step splitting it, the data becomes wide data
- First, we need to tag all of the categories extracted, and then create a same length as the original data, and the category with full-width array 0, and then traverse the original data tags on the corresponding position 0 changed to 1
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
df = pd.read_csv(r"lol_hero.csv", index_col=0)
temp_list = [eval(x) for x in df["tags"].tolist()] # 将tags提取出来
tag_all = set([j for i in temp_list for j in i]) # set(将所有属性遍历出来)
print(tag_all)
# 输出的结果
{'刺客','坦克','射手','战士',
'打野','控制','法师','法术',
'爆发','物理','突进','辅助',
'输出','近战','远程','魔法'}
Creating an array of 0
zeros_data = pd.DataFrame(np.zeros((df.shape[0], len(tag_all))), columns=list(tag_all))
Traversing the original data and assigns
for i in range(df.shape[0]):
zeros_data.loc[i, temp_list[i]] = 1
The two sets of data, and
data = pd.concat([df, zeros_data], axis=1).drop(labels="tags", axis=1)
Now we get wide data as shown below:
- If the data is made long data (long data) as shown below:
- Characteristic Analysis: long data often does not record missing values, the width data in order to maintain consistency of the data is to be recorded, for long data, for maintenance (increase or decrease observed value) is very easy to increase or decrease the data only need to process a row of data, then the data required to operate the column width, such as a new hero attribute property, then we need to add a new, in a common database, this operation is relatively difficult to achieve, so wide is hardly the type of data use in the database, from memory for the perspective, the long data when in storage, will be a lot more "invalid" data, of course, differences in memory tends to be reflected in a larger data, width data in large data the performance is better, while storage will be smaller in the hard disk
Second, Preliminary Data
-
Get a discrete string of data we can count the number of each attribute corresponds hero
-
The hero difficulty of the operation and defense level classification to see what information can be obtained
bar graph: sideways-source operating difficulty levels, the color code level defense, bottom color indicates operation difficulty grading
we can see the difficulty of the operation there are ten of them have a hero common feature is to protect our property are three that is small crispy high defense heroic mostly distributed in four - between six -
We try to use k-means we can be satisfied with the result
import pandas as pd
from sklearn.cluster import KMeans
df = pd.read_csv(r"lol_hero.csv", index_col=0)
df1 = data[["life", "physical", "magic", "difficulty"]]
kmodel = KMeans(n_clusters=10, n_jobs=4)
kmodel.fit(df1)
temp_label = kmodel.labels_.tolist()
df["type"] = temp_label
print(df[df["type"] == 1])
Found that such a class of random output characteristics belong AP addition of relatively high hero