Introduction to Machine Learning (7): Classification Algorithm-Decision Tree Algorithm

Learning Directory:
Insert picture description here
Decision Tree Content Directory:
Insert picture description here

1. The role of decision tree:

Insert picture description here
   This is our decision-making process for judging whether this is a good melon or a bad melon. The role of the decision tree:
1. Help us choose which feature to use first for if and which feature to use for if, so that we can quickly determine whether this is a good melon or a bad melon. Bad melon
2. Help us determine the value of the feature as the dividing standard

2. Principle Derivation

Insert picture description here
Insert picture description here

Insert picture description here

3. Code prediction:

Insert picture description here

Case comparison: compare the classification accuracy of decision tree algorithm and KNN algorithm on the iris data set

Use the decision tree algorithm to classify the iris data set:

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
def dicision_iris():
   """使用决策树对鸢尾花数据集进行分类
   :return:"""
   # 获取数据
   iris = load_iris()
   # 划分数据集
   x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=22)
   #决策树估计器
   estimator=DecisionTreeClassifier(criterion="entropy")
   estimator.fit(x_train, y_train)  # 把训练数据放进去
   #模型评估
   #方法一:直接比对真实值和预测值
   y_predict=estimator.predict(x_test)
   print('y_predict:\n',y_predict)
   print('直接比对真实值和预测值:\n', y_test==y_predict)
   # 方法二:计算准确率
   score = estimator.score(x_test,y_test)
   print('准确率:\n', score)
   
if __name__=='__main__':
    dicision_iris()

Insert picture description here
Use the KNN algorithm to classify the iris data set:
Insert picture description here
Conclusion : The KNN algorithm is more accurate, because the KNN algorithm is originally suitable for small data sets (the iris data set has only 150 samples), and he goes to calculate the distance one by one. The decision tree is more suitable for large data sets. -------- Different algorithms are suitable for different scenarios

Four. Decision tree visualization

Insert picture description here
Insert picture description here
The generated dot file:

digraph Tree {
    
    
node [shape=box] ;
0 [label="petal width (cm) <= 0.75\nentropy = 1.584\nsamples = 112\nvalue = [39, 37, 36]"] ;
1 [label="entropy = 0.0\nsamples = 39\nvalue = [39, 0, 0]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="petal width (cm) <= 1.75\nentropy = 1.0\nsamples = 73\nvalue = [0, 37, 36]"] ;
0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
3 [label="petal length (cm) <= 5.05\nentropy = 0.391\nsamples = 39\nvalue = [0, 36, 3]"] ;
2 -> 3 ;
4 [label="sepal length (cm) <= 4.95\nentropy = 0.183\nsamples = 36\nvalue = [0, 35, 1]"] ;
3 -> 4 ;
5 [label="petal length (cm) <= 3.9\nentropy = 1.0\nsamples = 2\nvalue = [0, 1, 1]"] ;
4 -> 5 ;
6 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
5 -> 6 ;
7 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 0, 1]"] ;
5 -> 7 ;
8 [label="entropy = 0.0\nsamples = 34\nvalue = [0, 34, 0]"] ;
4 -> 8 ;
9 [label="petal width (cm) <= 1.55\nentropy = 0.918\nsamples = 3\nvalue = [0, 1, 2]"] ;
3 -> 9 ;
10 [label="entropy = 0.0\nsamples = 2\nvalue = [0, 0, 2]"] ;
9 -> 10 ;
11 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
9 -> 11 ;
12 [label="petal length (cm) <= 4.85\nentropy = 0.191\nsamples = 34\nvalue = [0, 1, 33]"] ;
2 -> 12 ;
13 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
12 -> 13 ;
14 [label="entropy = 0.0\nsamples = 33\nvalue = [0, 0, 33]"] ;
12 -> 14 ;
}

Insert picture description here
Copy the content of the dot file into this website:
Insert picture description here

5. Summary of decision tree algorithm:

Advantages:
          Visualization-strong interpretability
Disadvantages:
          When the amount of data is very large, the decision tree will be too complicated, which will easily lead to good training samples and poor test samples (overfitting).
How to improve?
          Decrease cart algorithm (implemented in decision tree API)
          random forest algorithm (in the next section)

6. Case: Survival prediction of Titanic passengers

Insert picture description here
Extract position, age, gender, and others do not affect his survival:
Insert picture description here
process the missing values ​​in the age, fill in the average value:
Insert picture description here
convert to dictionary format:
Insert picture description here
partition data set, dictionary feature extraction, decision tree predictor, evaluation :
Insert picture description here
Results:
Insert picture description here

import pandas as pd
import graphviz
from sklearn import datasets
from sklearn.tree import DecisionTreeClassifier, export_graphviz
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer

# 获取数据
path = "http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt"
titanic = pd.read_csv(path)

# 获取特征值与目标值
x = titanic[['pclass', 'age', 'sex']]
y = titanic['survived']

# 处理数据
x['age'].fillna(x['age'].mean(), inplace=True)
# 转换成字典
x = x.to_dict(orient='recorda')
# 划分数据
x_train, x_test, y_train, y_test = train_test_split(x, y)

# 字典特征提取
transfer = DictVectorizer()
x_train = transfer.fit_transform(x_train)
x_test = transfer.transform(x_test)

# 构建模型
estimator = DecisionTreeClassifier(criterion='entropy')
estimator.fit(x_train, y_train)
y_predict = estimator.predict(x_test)
print('预测值与真实值比对:\n', y_predict == y_test)
# 求准确率
score = model.score(x_test, y_test)
print('准确率:\n', score)
# 可视化决策树
image = export_graphviz(
      estimator,
      out_file="C:/Users/Admin/Desktop/iris_tree.dot",
      feature_names=transfer.get_feature_names(),
)

Summary of classification algorithms:
Insert picture description here

Guess you like

Origin blog.csdn.net/qq_45234219/article/details/114998217