Basics of Machine Learning: "Classification Algorithm (6) - Decision Tree"

1. Decision tree

1. Understanding decision trees.
The origin of the idea of ​​decision trees is very simple. The conditional branch structure in programming is the if-else structure. The earliest decision tree is a classification learning method that uses this type of structure to segment data.

2. An example of a conversation

Think about why this girl puts age at the top of her judgment! ! !
How to make decisions efficiently? The order of features

2. Detailed explanation of decision tree classification principles

1. Let’s use an example of a problem

It is known that there are four feature values ​​to predict whether to lend to a certain person
(1) first look at the house, and then look at the job --> whether to lend (only two features are looked at)
(2) age, credit situation, job --> see Three features are included.
The second method is not as efficient as the first.
I hope to find a mathematical method to quickly and automatically determine which feature should be looked at first.

2. Basics of information theory
It is necessary to introduce information theory knowledge such as information entropy and information gain! ! !

(1) Information
Shannon's definition: something that eliminates random uncertainty.
Xiao Ming's age "I am 18 years old this year"
Xiao Hua "Xiao Ming will be 19 years old next year"

After Xiao Ming said it, what Xiao Hua said became nonsense, not information.

(2) Measurement of information-amount of information-information entropy

3. The definition of information entropy.
The professional term for H is called information entropy, and its unit is bit.

4. Taking bank loan data as an example, calculate information entropy.
If someone’s age, job, house, and credit situation are known, will you lend to this person?
The magnitude of uncertainty needs to be measured.
There are two situations here, one is a loan and the other is a no loan.
The probability of no loan is 6/15 and the probability of loan is 9/15
H (total) = -(6/15 * log 6/15 + 9/15 * log 9/15) = 0.971

When we know a certain feature, the uncertainty will be reduced.
So if we can find out the degree to which the uncertainty is reduced after knowing a certain feature. After comparing again, after knowing which feature, the uncertainty is reduced the most. Can we look at this feature first?

After knowing a certain feature, what is its information entropy?
Introduction - information gain

5. Information gain:
One of the basis for dividing decision trees - information gain

(1) Definition and formula
The information gain g(D,A) of feature A on training data set D is defined as the information entropy H(D) of set D and the information conditional entropy H(D| The difference between A)
g(D,A) = H(D) - conditional entropy H(D|A)
information gain measures the degree to which the uncertainty of a certain feature is reduced after knowing it

Calculate the information gain after knowing the age:
g(D,age) = H(D) - H(D|age)

Find H(D|Age):
H(Youth) = -(2/5 * log 2/5 + 3/5 * log 3/5) = 
H(Middle-aged) = -(2/5 * log 2/5 + 3/5 * log 3/5) = 
H(old age) = -(1/5 * log 1/5 + 4/5 * log 4/5) = 
H(D|age) = 1/3 * H( Youth) + 1/3 * H (middle-aged) + 1/3 * H (elderly)

We use A1, A2, A3, and A4 to represent age, job, own house, and loan status. The final calculated results are g(D, A1) = 0.313, g(D, A2) = 0.324, g(D, A3) = 0.420, g(D, A4) = 0.363. So we choose A3 as the first feature of division

(2) Official

(3) Of course, the principle of decision tree is not only information gain, but also other methods
ID3:
  information gain, the largest criterion
C4.5:
  information gain ratio, the largest criterion
CART:
  classification tree: Gini coefficient, the smallest criterion, In sklearn, you can choose the default principle
  advantage of division: the division is more detailed (understand from the following example)

3. Decision Tree API

1. API
class sklearn.tree.DecisionTreeClassifier(criterion='gini', max_depth=None, random_state=None)
Decision tree analyzer
criterion: The default is the 'gini' coefficient, you can also choose the entropy of information gain 'entropy'
max_depth: tree The depth of the tree is too deep
  , which will lead to overfitting.
  Overfitting will lead to poor generalization ability of the model, that is, it is overly suitable for the current sample set and lacks the ability to adapt to (predict) new samples. random_state:
random number seed

2. Decision tree classification of iris flowers

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier

def KNN_iris():
    """
    用KNN算法对鸢尾花进行分类
    """
    # 1、获取数据
    iris = load_iris()
    print("iris.data:\n", iris.data)
    print("iris.target:\n", iris.target)
    # 2、划分数据集
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)
    # 3、特征工程:标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    # 用训练集的平均值和标准差对测试集的数据来标准化
    # 这里测试集和训练集要有一样的平均值和标准差,而fit的工作就是计算平均值和标准差,所以train的那一步用fit计算过了,到了test这就不需要再算一遍自己的了,直接用train的就可以
    x_test = transfer.transform(x_test)
    # 4、KNN算法预估器
    estimator = KNeighborsClassifier(n_neighbors=3)
    estimator.fit(x_train, y_train)
    # 5、模型评估
    # 方法1:直接比对真实值和预测值
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比对真实值和预测值:\n", y_test == y_predict)
    # 方法2:计算准确率
    score = estimator.score(x_test, y_test)
    print("准确率为:\n", score)
    return None
 
def KNN_iris_gscv():
    """
    用KNN算法对鸢尾花进行分类,添加网格搜索和交叉验证
    """
    # 1、获取数据
    iris = load_iris()
    print("iris.data:\n", iris.data)
    print("iris.target:\n", iris.target)
    # 2、划分数据集
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)
    # 3、特征工程:标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    # 用训练集的平均值和标准差对测试集的数据来标准化
    # 这里测试集和训练集要有一样的平均值和标准差,而fit的工作就是计算平均值和标准差,所以train的那一步用fit计算过了,到了test这就不需要再算一遍自己的了,直接用train的就可以
    x_test = transfer.transform(x_test)
    # 4、KNN算法预估器
    estimator = KNeighborsClassifier()
    # 加入网格搜索和交叉验证
    # 参数准备
    param_dict = {"n_neighbors": [1, 3, 5, 7, 9, 11]}
    estimator = GridSearchCV(estimator, param_grid=param_dict, cv=10)
    estimator.fit(x_train, y_train)
    # 5、模型评估
    # 方法1:直接比对真实值和预测值
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比对真实值和预测值:\n", y_test == y_predict)
    # 方法2:计算准确率
    score = estimator.score(x_test, y_test)
    print("准确率为:\n", score)
    #最佳参数:best_params_
    print("最佳参数:\n", estimator.best_params_)
    #最佳结果:best_score_
    print("最佳结果:\n", estimator.best_score_)
    #最佳估计器:best_estimator_
    print("最佳估计器:\n", estimator.best_estimator_)
    #交叉验证结果:cv_results_
    print("交叉验证结果:\n", estimator.cv_results_)
    return None

def nb_news():
    """
    用朴素贝叶斯算法对新闻进行分类
    """
    # 1、获取数据
    news = fetch_20newsgroups(subset="all")
    # 2、划分数据集
    x_train, x_test, y_train, y_test = train_test_split(news.data, news.target)
    # 3、特征工程:文本特征抽取-tfidf
    transfer = TfidfVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    # 4、朴素贝叶斯算法预估器流程
    estimator = MultinomialNB()
    estimator.fit(x_train, y_train)
    # 5、模型评估
    # 方法1:直接比对真实值和预测值
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比对真实值和预测值:\n", y_test == y_predict)
    # 方法2:计算准确率
    score = estimator.score(x_test, y_test)
    print("准确率为:\n", score)
    return None

def decision_iris():
    """
    用决策树对鸢尾花数据进行分类
    """
    # 1、获取数据集
    iris = load_iris()
    # 2、划分数据集
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)
    # 3、决策树预估器
    estimator = DecisionTreeClassifier(criterion='entropy')
    estimator.fit(x_train, y_train)
    # 4、模型评估
    # 方法1:直接比对真实值和预测值
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比对真实值和预测值:\n", y_test == y_predict)
    # 方法2:计算准确率
    score = estimator.score(x_test, y_test)
    print("准确率为:\n", score)
    return None

if __name__ == "__main__":
    # 代码1:用KNN算法对鸢尾花进行分类
    KNN_iris()
    # 代码2:用KNN算法对鸢尾花进行分类,添加网格搜索和交叉验证
    #KNN_iris_gscv()
    # 代码3:用朴素贝叶斯算法对新闻进行分类
    #nb_news()
    # 代码4:用决策树对鸢尾花数据进行分类
    decision_iris()

operation result:

y_predict:
 [0 2 0 0 2 1 1 0 2 1 2 1 2 2 1 1 2 1 1 0 0 2 0 0 1 1 1 2 0 1 0 1 0 0 1 2 1
 2]
直接比对真实值和预测值:
 [ True  True  True  True  True  True False  True  True  True  True  True
  True  True  True False  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True False  True
  True  True]
准确率为:
 0.9210526315789473
y_predict:
 [0 2 0 0 2 1 1 0 2 1 2 1 2 2 1 1 2 1 1 0 0 2 0 0 1 1 1 2 0 1 0 1 0 0 1 2 1
 2]
直接比对真实值和预测值:
 [ True  True  True  True  True  True False  True  True  True  True  True
  True  True  True False  True  True  True  True  True  True  True  True
  True  True  True  True  True  True  True  True  True  True False  True
  True  True]
准确率为:
 0.9210526315789473

KNN is a lazy algorithm. During calculation, it frantically calculates the distance between each sample in the memory. The
decision tree application scenario is more suitable for use in situations where the amount of data is relatively large.

4. Visualization of decision trees

1. Save the tree structure to the dot file
sklearn.tree.export_graphviz().
This function can export the DOT format.

2. tree.export_graphviz(estimator, out_file='tree.dot', feature_names=['',''])
estimator: estimator object
out_file: exported name
feature_names: name of feature

3. Modify the code

from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.tree import DecisionTreeClassifier, export_graphviz

def KNN_iris():
    """
    用KNN算法对鸢尾花进行分类
    """
    # 1、获取数据
    iris = load_iris()
    print("iris.data:\n", iris.data)
    print("iris.target:\n", iris.target)
    # 2、划分数据集
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)
    # 3、特征工程:标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    # 用训练集的平均值和标准差对测试集的数据来标准化
    # 这里测试集和训练集要有一样的平均值和标准差,而fit的工作就是计算平均值和标准差,所以train的那一步用fit计算过了,到了test这就不需要再算一遍自己的了,直接用train的就可以
    x_test = transfer.transform(x_test)
    # 4、KNN算法预估器
    estimator = KNeighborsClassifier(n_neighbors=3)
    estimator.fit(x_train, y_train)
    # 5、模型评估
    # 方法1:直接比对真实值和预测值
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比对真实值和预测值:\n", y_test == y_predict)
    # 方法2:计算准确率
    score = estimator.score(x_test, y_test)
    print("准确率为:\n", score)
    return None
 
def KNN_iris_gscv():
    """
    用KNN算法对鸢尾花进行分类,添加网格搜索和交叉验证
    """
    # 1、获取数据
    iris = load_iris()
    print("iris.data:\n", iris.data)
    print("iris.target:\n", iris.target)
    # 2、划分数据集
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)
    # 3、特征工程:标准化
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    # 用训练集的平均值和标准差对测试集的数据来标准化
    # 这里测试集和训练集要有一样的平均值和标准差,而fit的工作就是计算平均值和标准差,所以train的那一步用fit计算过了,到了test这就不需要再算一遍自己的了,直接用train的就可以
    x_test = transfer.transform(x_test)
    # 4、KNN算法预估器
    estimator = KNeighborsClassifier()
    # 加入网格搜索和交叉验证
    # 参数准备
    param_dict = {"n_neighbors": [1, 3, 5, 7, 9, 11]}
    estimator = GridSearchCV(estimator, param_grid=param_dict, cv=10)
    estimator.fit(x_train, y_train)
    # 5、模型评估
    # 方法1:直接比对真实值和预测值
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比对真实值和预测值:\n", y_test == y_predict)
    # 方法2:计算准确率
    score = estimator.score(x_test, y_test)
    print("准确率为:\n", score)
    #最佳参数:best_params_
    print("最佳参数:\n", estimator.best_params_)
    #最佳结果:best_score_
    print("最佳结果:\n", estimator.best_score_)
    #最佳估计器:best_estimator_
    print("最佳估计器:\n", estimator.best_estimator_)
    #交叉验证结果:cv_results_
    print("交叉验证结果:\n", estimator.cv_results_)
    return None

def nb_news():
    """
    用朴素贝叶斯算法对新闻进行分类
    """
    # 1、获取数据
    news = fetch_20newsgroups(subset="all")
    # 2、划分数据集
    x_train, x_test, y_train, y_test = train_test_split(news.data, news.target)
    # 3、特征工程:文本特征抽取-tfidf
    transfer = TfidfVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    # 4、朴素贝叶斯算法预估器流程
    estimator = MultinomialNB()
    estimator.fit(x_train, y_train)
    # 5、模型评估
    # 方法1:直接比对真实值和预测值
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比对真实值和预测值:\n", y_test == y_predict)
    # 方法2:计算准确率
    score = estimator.score(x_test, y_test)
    print("准确率为:\n", score)
    return None

def decision_iris():
    """
    用决策树对鸢尾花数据进行分类
    """
    # 1、获取数据集
    iris = load_iris()
    # 2、划分数据集
    x_train, x_test, y_train, y_test = train_test_split(iris.data, iris.target, random_state=6)
    # 3、决策树预估器
    estimator = DecisionTreeClassifier(criterion='entropy')
    estimator.fit(x_train, y_train)
    # 4、模型评估
    # 方法1:直接比对真实值和预测值
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比对真实值和预测值:\n", y_test == y_predict)
    # 方法2:计算准确率
    score = estimator.score(x_test, y_test)
    print("准确率为:\n", score)
    # 可视化决策树
    export_graphviz(estimator, out_file='iris_tree.dot', feature_names=iris.feature_names)
    return None

if __name__ == "__main__":
    # 代码1:用KNN算法对鸢尾花进行分类
    #KNN_iris()
    # 代码2:用KNN算法对鸢尾花进行分类,添加网格搜索和交叉验证
    #KNN_iris_gscv()
    # 代码3:用朴素贝叶斯算法对新闻进行分类
    #nb_news()
    # 代码4:用决策树对鸢尾花数据进行分类
    decision_iris()

The iris_tree.dot file is generated after running:

digraph Tree {
node [shape=box] ;
0 [label="petal width (cm) <= 0.8\nentropy = 1.584\nsamples = 112\nvalue = [38, 38, 36]"] ;
1 [label="entropy = 0.0\nsamples = 38\nvalue = [38, 0, 0]"] ;
0 -> 1 [labeldistance=2.5, labelangle=45, headlabel="True"] ;
2 [label="petal width (cm) <= 1.65\nentropy = 0.999\nsamples = 74\nvalue = [0, 38, 36]"] ;
0 -> 2 [labeldistance=2.5, labelangle=-45, headlabel="False"] ;
3 [label="sepal length (cm) <= 7.1\nentropy = 0.179\nsamples = 37\nvalue = [0, 36, 1]"] ;
2 -> 3 ;
4 [label="entropy = 0.0\nsamples = 36\nvalue = [0, 36, 0]"] ;
3 -> 4 ;
5 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 0, 1]"] ;
3 -> 5 ;
6 [label="petal length (cm) <= 5.05\nentropy = 0.303\nsamples = 37\nvalue = [0, 2, 35]"] ;
2 -> 6 ;
7 [label="sepal width (cm) <= 2.9\nentropy = 0.863\nsamples = 7\nvalue = [0, 2, 5]"] ;
6 -> 7 ;
8 [label="entropy = 0.0\nsamples = 4\nvalue = [0, 0, 4]"] ;
7 -> 8 ;
9 [label="petal width (cm) <= 1.75\nentropy = 0.918\nsamples = 3\nvalue = [0, 2, 1]"] ;
7 -> 9 ;
10 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
9 -> 10 ;
11 [label="sepal width (cm) <= 3.1\nentropy = 1.0\nsamples = 2\nvalue = [0, 1, 1]"] ;
9 -> 11 ;
12 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 0, 1]"] ;
11 -> 12 ;
13 [label="entropy = 0.0\nsamples = 1\nvalue = [0, 1, 0]"] ;
11 -> 13 ;
14 [label="entropy = 0.0\nsamples = 30\nvalue = [0, 0, 30]"] ;
6 -> 14 ;
}

4. The generated image
http://webgraphviz.com
always displays loading...

Later, I found a generated one online.
Visit http://www.tasksteper.com:8099/flow/home/; log in with username/password: testuser1/testuser1; enter the "Integrated Tools" project; click "Create Entry"

5. Judgment process
(1) First look at the width of the petals, if the width is less than or equal to 0.8
(2) If not satisfied, continue to divide
(3) entropy: calculate the information gain
(4) samples: the number of samples

5. Advantages and Disadvantages of Decision Trees

1. Somewhat
simple and easy to understand, and can be visualized
. Visualization - strong interpretability

2. Disadvantages:
The tree is too complex and prone to overfitting.

3. Improved
branch reduction cart algorithm,
random forest has been implemented in the decision tree API
 

Guess you like

Origin blog.csdn.net/csj50/article/details/132732152