Decision tree of machine learning summary

foreword

Decision Trees is a machine learning algorithm based on a tree structure. It is the most common data mining algorithm in recent years and can be used for classification and regression problems.

It can be used as a predictive model to infer the predicted results of a sample from the observed data of the sample. According to the difference of prediction results, decision tree learning can be subdivided into two categories.

  • Classification trees whose predictions are limited to a set of discrete values. Each branch of the tree corresponds to a set of classification features connected by logical AND, and the leaf nodes on the branch correspond to the classification labels that can be predicted by the above features.
  • A regression tree whose predictions are continuous values.

A decision tree can be considered as a collection of if-then rules, or as a conditional probability distribution defined on feature space and class space.

  • The if-then rule refers to a formal representation method used to describe the judgment process in the decision tree model. Each rule consists of a premise and a conclusion. For example, if you are using a decision tree to predict whether a person will buy a product, a rule might be: "If this person is under the age of 30 and has an income of $50,000 or more, then he will buy this product." The premise of this rule is "this person is under the age of 30 and has an income of more than $50,000", and the conclusion is "he will buy this product".
  • The feature space refers to the space formed by the feature vectors of all samples. In the feature space, each sample can be expressed as a vector
  • The class space refers to the space composed of all possible classes. In the class space, each class can be represented as a point or a region

The goal of the decision tree algorithm is to find a partition in the feature space such that the samples in each partition belong to the same class.

1. Introduction

1.1 Principle

The basic principle of the decision tree algorithm is to divide the data set according to a specific rule, so that the divided subsets are as pure as possible, that is, the samples in the same subset belong to the same category. This process can be regarded as a recursive process, selecting an optimal feature for division each time, until all samples belong to the same category or cannot be further divided.

When constructing a decision tree, we need to consider how to select the optimal features for division. Commonly used methods include ID3 (Iterative Dichotomiser 3), C4.5 , CART (Classification and Regression Trees), etc. Among them, ID3 and C4.5 use information gain for feature selection, while CART uses Gini impurity for feature selection.

  • Information gain: the change in information before and after the data set is divided
  • Gini impurity: Simply put, it is to randomly select subitems from a data set and measure the probability of being misclassified into other groups

1.2 Process

The basic flow of a decision tree is a recursive process from root to leaf, looking for partition attributes at each intermediate node, and the important thing about recursion is to set the stop condition:

  1. The samples contained in the current node belong to the same category and do not need to be divided;
  2. The current attribute set is empty, or all samples have the same value on all attributes and cannot be divided. The simple understanding is that when this node is assigned, all attribute features are used up, and no features are available. Give according to the number of labels. This node is labeled to make it a leaf node (in fact, the posterior probability of sample occurrence is used as the prior probability);
  3. The sample set contained in the current node is empty and cannot be divided. This situation occurs because the sample data lacks the value of this attribute, and the node is marked according to the label of the parent node (in fact, the label with the most occurrences of the parent node is used as the prior probability)

1.3 Information entropy, information gain and Gini impurity

  • Information entropy (entropy) is a measure of the uncertainty of the sample set, the smaller its value, the higher the purity of the sample set.

    In the decision tree algorithm, we use information entropy to calculate the purity of the sample set. Assume that the proportion of samples of class k in the sample set D is pk ( k = 1 , 2 , … , y ) p_k(k=1,2,…,y)pk(k=1,2,,y ) , then the information entropy of D is defined as:

    E nt ( D ) = − ∑ k = 1 ypklog 2 pk Ent(D)=-\sum_{k=1}^{y}p_klog_2p_kEnt(D)=k=1ypklog2pk

    Among them, yyy is the number of categories.

  • Information gain , which is based on information entropy, represents the amount of change brought about by the information obtained, and is usually used to select the optimal split feature. The formula for calculating information gain is as follows:

    G a i n ( D , A ) = E n t ( D ) − ∑ v = 1 V ∣ D v ∣ ∣ D ∣ E n t ( D v ) Gain(D, A) = Ent(D) - \sum_{v=1}^{V}\frac{|D^v|}{|D|}Ent(D^v) Gain(D,A)=Ent(D)v=1VDDvEnt(Dv)

    Among them, DDD represents the training data set of the current node,AAA represents the set of candidate features,VVV represents the number of features in the candidate feature set,D v D^vDv indicates that the current node follows the featureAAA 'svvA subset of v values ​​​​divided,E nt ( D ) Ent(D)Ent ( D ) represents the entropy of the current node,Ent ( D v ) Ent(D^v )Ent(Dv )indicates that the current node follows the featureAAA 'svvThe entropy of the subset after v values ​​are divided.

    The higher the information gain, the greater the contribution of the feature to the classification ability, that is, the feature can better distinguish samples of different categories.

  • Gini impurity is an indicator used to measure the purity of a data set, which represents the probability that two samples are randomly selected from the data set, and their categories are inconsistent.
    G ini ( D ) = ∑ k = 1 ypk ( 1 − pk ) Gini(D) = \sum_{k=1}^y{p_k(1- p_k)}G ini ( D )=k=1ypk(1pk)

    The lower the Gini impurity, the higher the purity of the data set. It is usually used to measure the splitting effect of a node. After the node represents the split of the node, the purity of the child nodes is higher, that is, the proportion of the same category samples contained in the child nodes is larger.

2. Build a decision tree

2.1 Feature Selection

Feature selection refers to selecting a feature from many features in the training data as the splitting standard of the current node. The advantages and disadvantages are as follows:

advantage:

  • Reduce the complexity of the decision tree, make the model simpler, reduce the risk of overfitting (meaning that the model performs well on the training set, but poorly on the test set), and improve the generalization ability of the model.
  • Reduce decision tree training time and storage space.

shortcoming:

  • Some important information may be lost, resulting in a decrease in model accuracy.
  • Some noise may be introduced, resulting in a decrease in model accuracy.
  • It may make the data more complex and lead to a decrease in the generalization ability of the model.

Commonly used feature selection methods include information gain, information gain ratio, Gini index, etc.

2.2 Decision tree generation

Decision tree generation refers to the process of generating a decision tree from training data. According to the above feature selection method, commonly used decision tree generation algorithms include ID3, C4.5, CART and so on.

The decision tree generates a tree structure by recursively splitting the training data, thereby realizing the classification of new data. The process of generating a decision tree can be divided into the following steps:

  1. Feature selection: Select a feature from the features of the training data as the splitting criterion of the current node.
  2. Node splitting: Divide the training data of the current node into several subsets according to the splitting criteria, and each subset corresponds to a child node.
  3. Recursively generate subtrees: Recursively execute steps 1 and 2 for each child node until the stop condition is met.

Stop conditions usually have the following types:

  • The training data of the current node all belong to the same category.
  • The training data for the current node is empty.
  • All features in the training data of the current node are the same and cannot be further divided.

2.3 Pruning

Decision tree pruning is a technique used to reduce the complexity of decision trees. Its purpose is to improve the generalization ability of the model by deleting some unnecessary nodes and subtrees. There are two commonly used decision tree pruning algorithms: pre-pruning and post-pruning.

  • Pre-pruning means that in the process of generating a decision tree, each node is evaluated. If the split of the current node cannot improve the generalization ability of the model, the split is stopped and the current node is marked as a leaf node. The advantage of pre-pruning is that it is simple and fast, but it may lead to underfitting.

  • Post-pruning refers to pruning the decision tree after the decision tree is generated, thereby reducing the complexity of the decision tree. The post-pruning process usually includes the following steps:

    1. Evaluate each non-leaf node and calculate the performance difference of the model before and after pruning on the validation set.
    2. Select the node with the smallest performance difference for pruning, delete the node and its subtree, and mark the node as a leaf node.
    3. Repeat steps 1 and 2 until you can no longer trim.

    The advantage of post-pruning is that it can avoid underfitting, but it may lead to overfitting.

3. Classical algorithm

3.1 ID3

The core idea of ​​the ID3 algorithm is to measure feature selection by information gain, and select the feature with the largest information gain for splitting.

  1. Calculate the information entropy of the data set;
  2. For each feature, calculate its information gain;
  3. Select the feature with the largest information gain as the partition attribute;
  4. Divide the data set into multiple subsets according to the value of this attribute;
  5. Steps 1-4 are called recursively for each subset until all samples belong to the same class or no further division is possible.

shortcoming:

  • ID3 has no pruning strategy and is prone to overfitting
  • The information gain criterion has a preference for features with a large number of possible values, and the information gain of features similar to "number" is close to 1
  • Can only be used to deal with discrete distribution of features
  • Missing values ​​are not considered

code show as below:

%matplotlib inline

import math
from collections import Counter,defaultdict
import matplotlib.pyplot as plt
from matplotlib.font_manager import FontProperties 
font_set = FontProperties(fname=r"c:\\windows\\fonts\\simsun.ttc", size=15)#导入宋体字体文件


class Id3DecideTree:
    
    def __init__(self, data_set, labels_set):
        self.tree = self.create_tree(data_set,labels_set)
        
    def calc_entropy(self, data):
        """计算数据集的信息熵"""
        label_counts = Counter(sample[-1] for sample in data)
        probs = [count / len(data) for count in label_counts.values()]
        return -sum(p * math.log(p, 2) for p in probs)

    def split_data(self, data, axis, value):
        """根据特征划分数据集"""
        return [sample[:axis] + sample[axis+1:] for sample in data if sample[axis] == value]


    def choose_best_feature(self, dataSet):
        """选择最好的数据集划分方式"""
        numFeatures = len(dataSet[0]) - 1      # 最后一列用于标签
        baseEntropy = self.calc_entropy(dataSet) # 计算数据集的熵
        bestFeature = -1
        for i in range(numFeatures):        # 遍历所有特征
            featList = [example[i] for example in dataSet] # 创建该特征的所有样本列表
            uniqueVals = set(featList)       # 获取唯一值的集合
            newEntropy = 0.0
            for value in uniqueVals:
                subDataSet = self.split_data(dataSet, i, value) # 划分数据集
                prob = len(subDataSet)/float(len(dataSet))
                newEntropy += prob * self.calc_entropy(subDataSet)
            infoGain = baseEntropy - newEntropy     # 计算信息增益;即熵的减少量
            if (infoGain > bestInfoGain):       # 比较目前为止最好的增益
                bestInfoGain = infoGain         # 如果比当前最好的更好,则设置为最好的
                bestFeature = i
        return bestFeature



    def majority_count(labels):
        """统计出现次数最多的类别"""
        label_counts = defaultdict(int)
        for label in labels:
            label_counts[label] += 1
        return max(label_counts, key=label_counts.get)


    def create_tree(self, data, labels):
        """创建决策树"""
        class_list = [sample[-1] for sample in data]
        # 所有样本同一类别
        if class_list.count(class_list[0]) == len(class_list):
            return class_list[0]
        # 只有一个特征
        if len(data[0]) == 1:
            return majority_count(class_list)
        # 选择最优划分特征
        best_feature_index = self.choose_best_feature(data)
        best_feature_label = labels[best_feature_index]
        tree = {
    
    best_feature_label: {
    
    }}
        del(labels[best_feature_index])
        feature_values = [sample[best_feature_index] for sample in data]
        unique_values = set(feature_values)
        for value in unique_values:
            sub_labels = labels[:]
            tree[best_feature_label][value] = self.create_tree(self.split_data(data, best_feature_index, value), sub_labels)
        return tree


class DecisionTreePlotter:
    def __init__(self, tree):
        self.tree = tree
        self.decisionNode = dict(boxstyle="sawtooth", fc="0.8")
        self.leafNode = dict(boxstyle="round4", fc="0.8")
        self.arrow_args = dict(arrowstyle="<-")
        self.font_set = font_set
        
    def getNumLeafs(self, node):
        firstStr = list(node.keys())[0]
        secondDict = node[firstStr]
        return sum([self.getNumLeafs(secondDict[key]) if isinstance(secondDict[key], dict) else 1 for key in secondDict.keys()])

    def getTreeDepth(self, node):
        firstStr = list(node.keys())[0]
        secondDict = node[firstStr]
        return max([1 + self.getTreeDepth(secondDict[key]) if isinstance(secondDict[key], dict) else 1 for key in secondDict.keys()])

    def plotNode(self, nodeTxt, centerPt, parentPt, nodeType):
        self.ax1.annotate(nodeTxt, xy=parentPt,  xycoords='axes fraction',
                 xytext=centerPt, textcoords='axes fraction',
                 va="center", ha="center", bbox=nodeType, arrowprops=self.arrow_args, fontproperties=self.font_set )

    def plotMidText(self, cntrPt, parentPt, txtString):
        xMid = (parentPt[0]-cntrPt[0])/2.0 + cntrPt[0]
        yMid = (parentPt[1]-cntrPt[1])/2.0 + cntrPt[1]
        self.ax1.text(xMid, yMid, txtString, va="center", ha="center", rotation=30, fontproperties=self.font_set)

    def plotTree(self):
        self.totalW = float(self.getNumLeafs(self.tree))
        self.totalD = float(self.getTreeDepth(self.tree))
        self.xOff = -0.5/self.totalW
        self.yOff = 1.0
        self.fig = plt.figure(1, facecolor='white')
        self.fig.clf()
        self.axprops = dict(xticks=[], yticks=[])
        self.ax1 = plt.subplot(111, frameon=False, **self.axprops)
        self.plotTreeHelper(self.tree, (0.5,1.0), '')
        plt.show()

    def plotTreeHelper(self, node, parentPt, nodeTxt):
        numLeafs = self.getNumLeafs(node)  
        depth = self.getTreeDepth(node)
        firstStr = list(node.keys())[0]     
        cntrPt = (self.xOff + (1.0 + float(numLeafs))/2.0/self.totalW, self.yOff)
        self.plotMidText(cntrPt, parentPt, nodeTxt)
        self.plotNode(firstStr, cntrPt, parentPt, self.decisionNode)
        secondDict = node[firstStr]
        self.yOff = self.yOff - 1.0/self.totalD
        for key in secondDict.keys():
            if isinstance(secondDict[key], dict):
                self.plotTreeHelper(secondDict[key],cntrPt,str(key))        
            else:   
                self.xOff = self.xOff + 1.0/self.totalW
                self.plotNode(secondDict[key], (self.xOff, self.yOff), cntrPt, self.leafNode)
                self.plotMidText((self.xOff, self.yOff), cntrPt, str(key))
        self.yOff = self.yOff + 1.0/self.totalD


labels_set = ['不浮出水面', '拥有鳍','有头']

data_set = [
     ['是', '是', '是', '是鱼类'],
     ['是', '是', '否', '不是鱼类'],
     ['是', '否', '是', '不是鱼类'],
     ['否', '是', '否', '不是鱼类'],
     ['否', '否', '是', '不是鱼类']
]

dt = Id3DecideTree(data_set, labels_set)

print(dt.tree)

plotter = DecisionTreePlotter(dt.tree)
plotter.plotTree()

dt_id3.png

3.2 C4.5

The C4.5 algorithm uses the information gain ratio to measure feature selection, and selects the feature with the largest information gain ratio for splitting.

The C4.5 algorithm is an improved version of the ID3 algorithm, and its specific process is as follows:

  1. Calculate the information entropy of the data set;
  2. For each feature, calculate its information gain ratio;
  3. Select the feature with the largest information gain ratio as the partition attribute;
  4. Divide the data set into multiple subsets according to the value of this attribute;
  5. Steps 1-4 are called recursively for each subset until all samples belong to the same class or no further division is possible.

The advantages of the C4.5 algorithm over the ID3 algorithm are:

  • Use the information gain ratio to select the best partition feature, avoiding the problem of biased selection of features with more values ​​in the ID3 algorithm

  • Handle both continuous and discrete attributes

  • Handle training data with missing attribute values

  • Prune the tree after creation

The disadvantages are:

  • C4.5 uses a multi-fork tree, and it is more efficient to use a binary tree
  • C4.5 can only be used for classification
  • The entropy model used by C4.5 has a lot of time-consuming logarithmic operations, continuous values ​​​​and sorting operations
  • C4.5 In the process of constructing the tree, the numerical attribute values ​​need to be sorted according to their size, and a split point should be selected from it, so it is only suitable for data sets that can reside in memory. When the training set is too large to fit in memory , the program cannot run
def calc_info_gain_ratio(data, feature_index):
    """计算信息增益比"""
    base_entropy = calc_entropy(data)
    feature_values = [sample[feature_index] for sample in data]
    unique_values = set(feature_values)
    new_entropy = 0.0
    split_info = 0.0
    for value in unique_values:
        sub_data = [sample for sample in data if sample[feature_index] == value]
        prob = len(sub_data) / float(len(data))
        new_entropy += prob * calc_entropy(sub_data)
        split_info -= prob * math.log(prob, 2)
    info_gain = base_entropy - new_entropy
    if split_info == 0:
        return 0
    return info_gain / split_info

def choose_best_feature(data):
    """选择最好的数据集划分方式"""
    num_features = len(data[0]) - 1
    base_entropy = calc_entropy(data)
    best_info_gain_ratio = 0.0
    best_feature_index = -1
    for i in range(num_features):
        info_gain_ratio = calc_info_gain_ratio(data, i)
        if info_gain_ratio > best_info_gain_ratio:
            best_info_gain_ratio = info_gain_ratio
            best_feature_index = i
    return best_feature_index

3.3 CART

CART chooses Gini impurity to measure feature selection, and selects the feature with the smallest Gini impurity for splitting. It is a binary recursive segmentation technique that divides the current sample into two sub-samples, so that each non-leaf node generated has two branch, so the decision tree generated by the CART algorithm is a binary tree with a concise structure

The CART algorithm is a binary decision tree, and its specific process is as follows:

  1. Choose a feature and a threshold to divide the dataset into two subsets;
  2. Step 1 is called recursively for each subset until all samples belong to the same class or no further division is possible.

The improvement of the CART algorithm over the ID3 algorithm and the C4.5 algorithm is that it uses the Gini index to select the best partition feature

import numpy as np

class CARTDecisionTree:
    def __init__(self):
        self.tree = {
    
    }

    def calc_gini(self, data):
        """计算基尼指数"""
        label_counts = {
    
    }
        for sample in data:
            label = sample[-1]
            if label not in label_counts:
                label_counts[label] = 0
            label_counts[label] += 1
        gini = 1.0
        for count in label_counts.values():
            prob = float(count) / len(data)
            gini -= prob ** 2
        return gini

    def split_data(self, data, feature_index, value):
        """根据特征划分数据集"""
        new_data = []
        for sample in data:
            if sample[feature_index] == value:
                new_sample = sample[:feature_index]
                new_sample.extend(sample[feature_index+1:])
                new_data.append(new_sample)
        return new_data

    def choose_best_feature(self, data):
        """选择最佳划分特征"""
        num_features = len(data[0]) - 1
        best_gini_index = np.inf
        best_feature_index = -1
        best_split_value = None
        for i in range(num_features):
            feature_values = [sample[i] for sample in data]
            unique_values = set(feature_values)
            for value in unique_values:
                sub_data = self.split_data(data, i, value)
                prob = len(sub_data) / float(len(data))
                gini_index = prob * self.calc_gini(sub_data)
                gini_index += (1 - prob) * self.calc_gini([sample for sample in data if sample[i] != value])
                if gini_index < best_gini_index:
                    best_gini_index = gini_index
                    best_feature_index = i
                    best_split_value = value
        return best_feature_index, best_split_value

    def majority_count(self, labels):
        """统计出现次数最多的类别"""
        label_counts = {
    
    }
        for label in labels:
            if label not in label_counts:
                label_counts[label] = 0
            label_counts[label] += 1
        sorted_label_counts = sorted(label_counts.items(), key=lambda x: x[1], reverse=True)
        return sorted_label_counts[0][0]

    def create_tree(self, data, labels):
        """创建决策树"""
        class_list = [sample[-1] for sample in data]
        if class_list.count(class_list[0]) == len(class_list):
            return class_list[0]
        if len(data[0]) == 1:
            return self.majority_count(class_list)
        best_feature_index, best_split_value = self.choose_best_feature(data)
        best_feature_label = labels[best_feature_index]
        tree = {
    
    best_feature_label: {
    
    }}
        del(labels[best_feature_index])
        feature_values = [sample[best_feature_index] for sample in data]
        unique_values = set(feature_values)
        for value in unique_values:
            sub_labels = labels[:]
            tree[best_feature_label][value] = self.create_tree(self.split_data(data, best_feature_index, value), sub_labels)
        return tree

    def fit(self, X_train, y_train):
        """训练模型"""
        data_set = np.hstack((X_train, y_train.reshape(-1, 1)))
        labels_set=['feature_{}'.format(i) for i in range(X_train.shape[1])]
        labels_set.append('label')
        
        self.tree=self.create_tree(data_set.tolist(),labels_set)

    def predict(self,X_test):
       """预测"""
       y_pred=[]
       for x_test in X_test:
           node=self.tree.copy()
           while isinstance(node,dict):
               feature=list(node.keys())[0]
               node=node[feature]
               feature_idx=int(feature.split('_')[-1])
               if x_test[feature_idx]==list(node.keys())[0]:
                   node=node[node.keys()[0]]
               else:
                   node=node[node.keys()[1]]
           y_pred.append(node)
       return np.array(y_pred)

4. Case

4.1 Iris Dataset Iris Classification

Iris dataset. This dataset contains 150 samples, each of which has four features (sepal length, sepal width, petal length, and petal width), and each sample belongs to one of three classes (Iris albicans, Iris versicolor, or Virginie sub-iris).

Directly call the sklearn library implementation

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
import graphviz

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

clf = DecisionTreeClassifier(criterion='entropy')
clf.fit(X_train, y_train)

class_names = ['山鸢尾', '变色鸢尾', '维吉尼亚鸢尾']
feature_names = ['萼片长度', '萼片宽度', '花瓣长度', '花瓣宽度']
dot_data = export_graphviz(clf, out_file=None, feature_names=feature_names, class_names=class_names, filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph.render('iris_decision_tree')
graph

Source.png

  • Entropy represents the information entropy of the node
  • samples indicates that the node has the number of samples
  • value represents the number of samples of each category in the node
  • class indicates which category the node is classified into

The above decision tree is a bit complex and optimized using parameter control and pruning

from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split, GridSearchCV, cross_val_score
import graphviz

iris = load_iris()

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.3, random_state=42)

# 定义参数范围
param_grid = {
    
    'max_depth': range(1, 10), 'min_samples_leaf': range(1, 10)}

# 使用网格搜索找到最佳参数
grid_search = GridSearchCV(DecisionTreeClassifier(criterion='entropy'), param_grid, cv=5)
grid_search.fit(X_train, y_train)

# 使用最佳参数训练模型
clf = DecisionTreeClassifier(criterion='entropy', **grid_search.best_params_)
clf.fit(X_train, y_train)

# 交叉验证评估每个子树的性能
cv_scores = []
for i in range(1, clf.tree_.max_depth + 1):
    clf_pruned = DecisionTreeClassifier(criterion='entropy', max_depth=i)
    scores = cross_val_score(clf_pruned, X_train, y_train, cv=5)
    cv_scores.append((i, scores.mean()))

# 选择最佳子树进行剪枝
best_depth = max(cv_scores, key=lambda x: x[1])[0]
clf_pruned = DecisionTreeClassifier(criterion='entropy', max_depth=best_depth)
clf_pruned.fit(X_train, y_train)

class_names = ['山鸢尾', '变色鸢尾', '维吉尼亚鸢尾']
feature_names = ['萼片长度', '萼片宽度', '花瓣长度', '花瓣宽度']
dot_data = export_graphviz(clf_pruned, out_file=None, feature_names=feature_names, class_names=class_names, filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph.render('iris_decision_tree_pruned')
graph

iris_pruned.png

4.2 Prediction of the outcome of the League of Legends game based on the decision tree

Dataset source:

feature name meaning
gameId Game Id
blueWins Whether the blue side wins
blueWardsPlaced Look at the quantity
blueWardsDestroyed Number of broken eyes
blueFirstBlood Do you get first blood
blueKills Kills
blueDeaths number of deaths
blueAssists Assists
blueEliteMonsters dragon and vanguard number
blueDragons dragon number
blueHeralds canyon pioneer number
blueTowersDestroyed Number of push towers
blueTotalGold total economy
blueAvgLevel average grade
blueTotalExperience total experience
blueTotalMinionsKilled total replenishment
blueTotalJungleMinionsKilled Monster kills
blueGoldDiff economic difference
blueExperienceDiff experience difference
blueCSPerMin Average replenishment numbers per minute
blueGoldPerMin economy per minute

Code reference: https://www.kaggle.com/code/xiyuewang/lol-how-to-win#Introduction

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn import tree
from sklearn.model_selection import GridSearchCV
import graphviz

# %matplotlib inline
# sns.set_style('darkgrid')

df = pd.read_csv('high_diamond_ranked_10min.csv')

df_clean = df.copy()


# 删除冗余的列
cols = ['gameId', 'redFirstBlood', 'redKills', 'redEliteMonsters', 'redDragons','redTotalMinionsKilled',
       'redTotalJungleMinionsKilled', 'redGoldDiff', 'redExperienceDiff', 'redCSPerMin', 'redGoldPerMin', 'redHeralds',
       'blueGoldDiff', 'blueExperienceDiff', 'blueCSPerMin', 'blueGoldPerMin', 'blueTotalMinionsKilled']
df_clean = df_clean.drop(cols, axis = 1)


# g = sns.PairGrid(data=df_clean, vars=['blueKills', 'blueAssists', 'blueWardsPlaced', 'blueTotalGold'], hue='blueWins', size=3, palette='Set1')
# g.map_diag(plt.hist)
# g.map_offdiag(plt.scatter)
# g.add_legend();

# plt.figure(figsize=(16, 12))
# sns.heatmap(df_clean.drop('blueWins', axis=1).corr(), cmap='YlGnBu', annot=True, fmt='.2f', vmin=0);

# 进一步抉择
cols = ['blueAvgLevel', 'redWardsPlaced', 'redWardsDestroyed', 'redDeaths', 'redAssists', 'redTowersDestroyed',
       'redTotalExperience', 'redTotalGold', 'redAvgLevel']
df_clean = df_clean.drop(cols, axis=1)

print(df_clean)

# 计算与第一列的相关性,原理为计算皮尔逊相关系数,取值范围为[-1,1],可以用来衡量两个变量之间的线性相关程度。
corr_list = df_clean[df_clean.columns[1:]].apply(lambda x: x.corr(df_clean['blueWins']))

cols = []
for col in corr_list.index:
    if (corr_list[col]>0.2 or corr_list[col]<-0.2):
        cols.append(col)


df_clean = df_clean[cols]
# df_clean.hist(alpha = 0.7, figsize=(12,10), bins=5);

from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz
X = df_clean
y = df['blueWins']

# scaler = MinMaxScaler()
# scaler.fit(X)
# X = scaler.transform(X)


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)


tree = tree.DecisionTreeClassifier(max_depth=3)

# search the best params
grid = {
    
    'min_samples_split': [5, 10, 20, 50, 100]},

clf_tree = GridSearchCV(tree, grid, cv=5)
clf_tree.fit(X_train, y_train)

pred_tree = clf_tree.predict(X_test)

# get the accuracy score
acc_tree = accuracy_score(pred_tree, y_test)
print(acc_tree)

# 0,1
class_names = ['红色方胜', '蓝色方胜']
feature_names = cols
dot_data = export_graphviz(clf_tree.best_estimator_, out_file=None, feature_names=feature_names, class_names=class_names, filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph.render('lol_decision_tree')
graph

lol.png

reference

  1. Machine Learning in Action
  2. In the decision tree algorithm, the difference between CART and ID3, C4.5 feature selection
  3. Python code: Recursive implementation of C4.5 decision tree generation, pruning, and classification
  4. https://github.com/43254022km/C4.5-Algorithm
  5. https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html#sklearn.tree.DecisionTreeClassifier

Guess you like

Origin blog.csdn.net/qq_23091073/article/details/131354204