Learning Directory:
Decision Tree Content Directory:
1. The role of decision tree:
This is our decision-making process for judging whether this is a good melon or a bad melon. The role of the decision tree:
1. Help us choose which feature to use first for if and which feature to use for if, so that we can quickly determine whether this is a good melon or a bad melon. Bad melon
2. Help us determine the value of the feature as the dividing standard
2. Principle Derivation
3. Code prediction:
Case comparison: compare the classification accuracy of decision tree algorithm and KNN algorithm on the iris data set
Use the decision tree algorithm to classify the iris data set:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
def dicision_iris():
"""使用决策树对鸢尾花数据集进行分类
:return:"""
# 获取数据
iris = load_iris()
# 划分数据集
x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=22)
#决策树估计器
estimator=DecisionTreeClassifier(criterion="entropy")
estimator.fit(x_train, y_train) # 把训练数据放进去
#模型评估
#方法一:直接比对真实值和预测值
y_predict=estimator.predict(x_test)
print('y_predict:\n',y_predict)
print('直接比对真实值和预测值:\n', y_test==y_predict)
# 方法二:计算准确率
score = estimator.score(x_test,y_test)
print('准确率:\n', score)
if __name__=='__main__':
dicision_iris()
Use the KNN algorithm to classify the iris data set:
Conclusion : The KNN algorithm is more accurate, because the KNN algorithm is originally suitable for small data sets (the iris data set has only 150 samples), and he goes to calculate the distance one by one. The decision tree is more suitable for large data sets. -------- Different algorithms are suitable for different scenarios
Four. Decision tree visualization
The generated dot file:
digraph Tree {
node [shape=box] ;
0 [label="petal width (cm) <= 0.75\nentropy = 1.584\nsamples = 112\nvalue = [39, 37, 36]"] ;
1 [label="entropy = 0.0\nsamples = 39\nvalue = [39, 0, 0]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="petal width (cm) <= 1.75\nentropy = 1.0\nsamples = 73\nvalue = [0, 37, 36]"] ;
0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
3 [label="petal length (cm) <= 5.05\nentropy = 0.391\nsamples = 39\nvalue = [0, 36, 3]"] ;
2 -> 3 ;
4 [label="sepal length (cm) <= 4.95\nentropy = 0.183\nsamples = 36\nvalue = [0, 35, 1]"] ;
3 -> 4 ;
5 [label="petal length (cm) <= 3.9\nentropy = 1.0\nsamples = 2\nvalue = [0, 1, 1]"] ;
4 -> 5 ;
6 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
5 -> 6 ;
7 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 0, 1]"] ;
5 -> 7 ;
8 [label="entropy = 0.0\nsamples = 34\nvalue = [0, 34, 0]"] ;
4 -> 8 ;
9 [label="petal width (cm) <= 1.55\nentropy = 0.918\nsamples = 3\nvalue = [0, 1, 2]"] ;
3 -> 9 ;
10 [label="entropy = 0.0\nsamples = 2\nvalue = [0, 0, 2]"] ;
9 -> 10 ;
11 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
9 -> 11 ;
12 [label="petal length (cm) <= 4.85\nentropy = 0.191\nsamples = 34\nvalue = [0, 1, 33]"] ;
2 -> 12 ;
13 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
12 -> 13 ;
14 [label="entropy = 0.0\nsamples = 33\nvalue = [0, 0, 33]"] ;
12 -> 14 ;
}
Copy the content of the dot file into this website:
5. Summary of decision tree algorithm:
Advantages:
Visualization-strong interpretability
Disadvantages:
When the amount of data is very large, the decision tree will be too complicated, which will easily lead to good training samples and poor test samples (overfitting).
How to improve?
Decrease cart algorithm (implemented in decision tree API)
random forest algorithm (in the next section)
6. Case: Survival prediction of Titanic passengers
Extract position, age, gender, and others do not affect his survival:
process the missing values in the age, fill in the average value:
convert to dictionary format:
partition data set, dictionary feature extraction, decision tree predictor, evaluation :
Results:
import pandas as pd
import graphviz
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer
# 获取数据
path = "http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt"
titanic = pd.read_csv(path)
# 获取特征值与目标值
x = titanic[['pclass', 'age', 'sex']]
y = titanic['survived']
# 处理数据
x['age'].fillna(x['age'].mean(), inplace=True)
# 转换成字典
x = x.to_dict(orient='recorda')
# 划分数据
x_train, x_test, y_train, y_test = train_test_split(x, y)
# 字典特征提取
transfer = DictVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)
# 构建模型
estimator = DecisionTreeClassifier(criterion='entropy')
estimator.fit(x_train, y_train)
y_predict = estimator.predict(x_test)
print('预测值与真实值比对:\n', y_predict == y_test)
# 求准确率
score = model.score(x_test, y_test)
print('准确率:\n', score)
# 可视化决策树
image = export_graphviz(
estimator,
out_file="C:/Users/Admin/Desktop/iris_tree.dot",
feature_names=transfer.get_feature_names(),
)
Summary of classification algorithms: