Using Python for the League of Heroes data clustering analysis (string discrete)

Here Insert Picture Description

A string discretization

The figure is the data we need to analyze this, there is a tags label which means that each hero's attributes, each attribute has multiple heroes, they house a list (type string) that we do is the first step splitting it, the data becomes wide data

First, we need to tag all of the categories extracted, and then create a same length as the original data, and the category with full-width array 0, and then traverse the original data tags on the corresponding position 0 changed to 1

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


df = pd.read_csv(r"lol_hero.csv", index_col=0)
temp_list = [eval(x) for x in  df["tags"].tolist()]  # 将tags提取出来
tag_all =  set([j for i in temp_list for j in i])  # set(将所有属性遍历出来)
print(tag_all)

# 输出的结果
{'刺客','坦克','射手','战士',
 '打野','控制','法师','法术',
 '爆发','物理','突进','辅助',
 '输出','近战','远程','魔法'}

Creating an array of 0

zeros_data = pd.DataFrame(np.zeros((df.shape[0], len(tag_all))), columns=list(tag_all))

Here Insert Picture Description
Traversing the original data and assigns

for i in range(df.shape[0]):
    zeros_data.loc[i, temp_list[i]] = 1

Here Insert Picture Description
The two sets of data, and

data = pd.concat([df, zeros_data], axis=1).drop(labels="tags", axis=1)

Now we get wide data as shown below:
Here Insert Picture Description

If the data is made long data (long data) as shown below:
Characteristic Analysis: long data often does not record missing values, the width data in order to maintain consistency of the data is to be recorded, for long data, for maintenance (increase or decrease observed value) is very easy to increase or decrease the data only need to process a row of data, then the data required to operate the column width, such as a new hero attribute property, then we need to add a new, in a common database, this operation is relatively difficult to achieve, so wide is hardly the type of data use in the database, from memory for the perspective, the long data when in storage, will be a lot more "invalid" data, of course, differences in memory tends to be reflected in a larger data, width data in large data the performance is better, while storage will be smaller in the hard disk

Second, Preliminary Data

Get a discrete string of data we can count the number of each attribute corresponds hero
The hero difficulty of the operation and defense level classification to see what information can be obtained

bar graph: sideways-source operating difficulty levels, the color code level defense, bottom color indicates operation difficulty grading
we can see the difficulty of the operation there are ten of them have a hero common feature is to protect our property are three that is small crispy high defense heroic mostly distributed in four - between six
We try to use k-means we can be satisfied with the result

import pandas as pd
from sklearn.cluster import KMeans


df = pd.read_csv(r"lol_hero.csv", index_col=0)
df1 = data[["life", "physical", "magic", "difficulty"]]
kmodel = KMeans(n_clusters=10, n_jobs=4)
kmodel.fit(df1)
temp_label = kmodel.labels_.tolist()
df["type"] = temp_label
print(df[df["type"] == 1])

Found that such a class of random output characteristics belong AP addition of relatively high hero
Here Insert Picture Description

Fantasy!

Published 23 original articles · won praise 99 · views 10000 +

Private letter concerns