Data mining learning - decision tree classification algorithm theory (including Iris actual combat)

Table of contents

1. Overview of decision tree classification algorithm and related formulas

(1) Basic idea

(2) Entropy formula

 (3) Gini coefficient formula

 2. ID3 algorithm

3.C4.5 Algorithm

4. CART algorithm

5. Comparison of various decision tree classification algorithms

6. Overfitting and decision tree pruning

(1) Overfitting (overfitting)

(2) Decision tree pruning method

1. Pruning first:

2. Post pruning

7. Decision tree combat (Iris data set training)

(1) Iris dataset:

(2) The actual combat begins

 8. The complete code of the iris data set actual combat


1. Overview of decision tree classification algorithm and related formulas

(1) Basic idea

The decision tree classification algorithm is a method of approximating the value of a discrete function, and it is a typical classification algorithm. Firstly, the existing classified data is processed, the rules are summarized and a decision tree is generated; then the new input data is analyzed according to the generated decision tree and it is judged which category it belongs to.

(2) Entropy formula

Entropy formula:

Conditional entropy formula:

 Information gain formula:

 (3) Gini coefficient formula

 2. ID3 algorithm

Specific steps:

(1) Starting from the root node, calculate the information gain of all possible features for the node, and use the feature with the largest information gain as the node.

(2) Create a child node based on the value of the feature, then repeatedly call the above method on the child node and create the next child node.

(3) Repeat the above two steps until no features can be selected.

3.C4.5 Algorithm

Taking information gain as the feature of dividing the training data set, there is a problem of biasing the selection of features with more values ​​(it is an optimization of the ID3 algorithm)

Specific steps:

(1) Starting from the root node, the information gain of all possible features is calculated for the node.

(2) Combine the entropy of each feature to find the information gain ratio of all possible features.

(3) Compare the information gain ratios and select the one with the largest information gain ratio as the root node.

(4) Create child nodes based on the value of this feature.

(5) Repeat the above steps until the feature selection is completed.

4. CART algorithm

The CART algorithm assumes that the decision tree is a binary tree, and the value of the internal node feature is 0 (No) or 1 (Yes), usually the left value is 1, and the right value is 0.

Specific steps:

(1) In the sample space where the training data set is located, recursively divide each feature into two regions

(2) Calculate the Gini coefficient according to the Gini solution formula

(3) Select the feature with the smallest Gini coefficient as the optimal feature, and its corresponding segmentation as the optimal segmentation point

(4) Distribute the training set features into its two child nodes according to the segmentation point

(5) Repeat the above steps

(6) If the number of samples is less than the threshold, or the Gini coefficient is less than the threshold, or the feature has been used, stop the calculation.

5. Comparison of various decision tree classification algorithms

algorithm support model tree structure feature selection Continuous value processing Missing value handling pruning
ID3 Classification Multi-fork tree information gain not support not support not support

C4.5

Classification Multi-fork tree information gain ratio support support support
CART classification, regression binary tree Gini coefficient, mean square error support support support

6. Overfitting and decision tree pruning

(1) Overfitting (overfitting)

If a model performs well on the training set, but poorly on the test set, it is said that there has been an overfitting phenomenon (it is equivalent to a student who memorizes the questions in the textbook by rote, and the result is in the exam. It won’t work if you change to a different question-making method at the right time. Generally speaking, this is called over-fitting)

Decision trees are very easy to overfit, but this problem can be reduced by pruning.

(2) Decision tree pruning method

1. Pruning first:

Define a height or threshold in advance to limit the free growth of the decision tree

2. Post pruning

Common methods: REP (reduced-error pruning, error rate reduction pruning), CCP (cost-complexity pruning, cost-complexity pruning)

7. Decision tree combat (Iris data set training)

(1) Iris dataset:

Also known as the iris data set , it is a data set for multivariate analysis. The data set contains 150 data samples, divided into 3 categories, each category has 50 pieces of data, and each piece of data contains 4 attribute values. These four attribute values ​​can be used to predict which category a certain iris belongs to.

(2) The actual combat begins

In order to finally view the saved decision tree dot file, you need to download the plugin in the following figure in the python settings:

 So let's get started!

1. Import the Iris dataset and instantiate it

code:

import pandas as pd
# classification_report用来显示主要分类指标的文本报告
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
from sklearn.tree import export_graphviz

# 加载数据
iris=load_iris()
irisdf=pd.DataFrame(iris.data,columns=iris.feature_names)
print(irisdf.head(5))

operation result:

2. Training model

code:

dct=DecisionTreeClassifier()
dct.fit(iris.data,iris.target)

3. Display model evaluation parameters such as precision, recall and F1 score

code:

print(classification_report(iris.target,dct.predict(iris.data)))

The result of the operation is as follows:

 

 4. Use export_graphviz to save the decision tree as a dot file, and open gvedit.exe after downloading GraphViz to view the decision tree

export_graphviz(dct,out_file='tree1.dot',feature_names=iris.feature_names,class_names=iris.target_names)

Running results: (Visualization of decision tree structure)

 8. The complete code of the iris data set actual combat

(You can modify and add visual results according to your own needs)

import pandas as pd
# classification_report用来显示主要分类指标的文本报告(显示模型各项指标)
from sklearn.metrics import classification_report
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import load_iris
# export_graphviz将决策树保存为dot文件,并打开下载完GraphViz后的gvedit.exe查看决策树
from sklearn.tree import export_graphviz


# 加载数据
iris=load_iris()
irisdf=pd.DataFrame(iris.data,columns=iris.feature_names)
dct=DecisionTreeClassifier()
dct.fit(iris.data,iris.target)
print(classification_report(iris.target,dct.predict(iris.data)))
export_graphviz(dct,out_file='tree1.dot',feature_names=iris.feature_names,class_names=iris.target_names)

Guess you like

Origin blog.csdn.net/weixin_52135595/article/details/126712445