Machine Learning Algorithms - Decision Tree summary

First, the decision tree algorithm mathematical principles

Classification tree is a basic regression method, the nodes and the edges with a composition, there are two types of nodes, interior nodes and leaf nodes, internal nodes represent attributes or characteristics , leaf nodes indicates the type , form a tree tree structure can be seen in the feature space defined space and class conditional probability distribution , can be seen as if-then rules set , each instance of a road and are covered by a rule, and only one is covered.

(1) algorithm ideas:

Step three: feature selection - Decision Tree - Decision Tree pruning

(2) two stages

Learning: using training data to build a decision tree model based on the principle of minimizing the loss of function
prediction: new data, decision tree classification model

(3) model advantages and disadvantages

Advantages:
1) readable, fast classification speed, simple and easy to understand, practical and efficient, built once, can be used multiple times, or only a simple model of tree maintenance can maintain the accuracy of its classification.
2) may be treated discrete values can also handle continuous value. Many algorithms just focus on discrete values or continuous values.
3) can be cross-validated pruning selected model, thereby improving the ability of generalization.
4) Good for outliers fault tolerance, high robustness.
5) substantially no pretreatment, no advance normalization, handle missing values.
Limitations:
1) decision tree algorithm is very easy to over-fitting, resulting in generalization is not strong. It can be improved by setting the number of samples and a minimum limit decision tree nodes depth.
2) decision tree samples occur because a little bit of change will lead to dramatic changes in the tree structure. This can be through integrated learning solutions and the like.
3) to find the optimal decision tree is an NP-hard problem, we generally through heuristics, easy to fall into local optimum. By methods such integrated study to improve.
4) Some of the more complex relationships, decision tree is difficult to learn, such as XOR. This is no way, and generally this relationship can change neural network classification methods to solve.
5) If the sample is too large proportion of certain features, decision trees tend to these features easily. This is accomplished by re-adjusting the sample weight to improve.

(4) the nature of the decision tree

A set of if-then rules from the point of view , is the essence of the decision tree induction training data from a set of classification rules, there may be more classification rules, to select the training data conflicts with minimal rules, while having a good generalization capability.
From the point of view of conditional probability , the number of decision-making is the essence of learning conditional probability model estimated by training data set, select the conditional probability model is based on not only a better fit to the training data, location data and also have a good forecast .

Select the best possible decision trees directly from the decision tree is NP-complete problem , in reality, the use of heuristic methods of learning suboptimal decision tree.

(5) decision tree loss function and learning strategies

Loss of function is usually regularized maximum likelihood function , learning strategy is to minimize the loss of function of the objective function .

(6) common algorithms

  • ID3
  • C4k5
  • The CART
    the ID3 decision tree is generally selected the maximum gain information about the features of decision trees.
    C4.5 decision tree is typically selected information gain than the maximum feature generates decision.
    CART decision tree generally selected minimum Gini index feature generates decision.
    Here Insert Picture Description

Second, the decision tree algorithm ideas

(1) Select feature

Feature selection to select features that have the ability to classify the training data, decision trees can improve learning efficiency. Select from a number of standard features as a feature of the current node division, feature selection quantitative assessment of different methods to derive different decision trees, usually gain information, information gain ratio, as the ID3 (by selecting feature information gain) , C4.5 (ratio by information gain selection feature), CART (characterized by selecting Gini index) and the like.

The entropy of the random variable X is defined as a
Here Insert Picture Description
condition in the entropy of the random variable X is defined to a random variable Y is set
Here Insert Picture Description
when the entropy and the conditional entropy of the probability of the data obtained from the estimation and the corresponding empirical entropy and the entropy experience called conditional entropy.

Information gain is defined

Here Insert Picture Description
Mutual information gain decision tree learning process is equivalent to the training data set type and characteristics, the greater the gain characteristic information has stronger classification ability, for a given set of data, after the calculation of each feature, we choose the maximum of feature information gain.

Appreciated: After selecting a large division as the division information gain characteristics, the higher the use of a subset of the feature obtained by dividing the purity, i.e., not smaller uncertainty. So we always choose the current makes the information gain the greatest feature to divide the data set.

Disadvantages: the information gain values ​​wherein more biased (reason: when a large value characteristic, according to this feature more readily divided into subsets of higher purity, so the lower the entropy is divided, i.e., less uncertainty and thus gain more information)

calculation steps:
Here Insert Picture Description

Information gain ratio is defined

Size information of the gain value is set for the data, information gain favor more feature values, gain ratio information can be used to correct the problem.
Here Insert Picture Description

(2) Decision Trees

ID3 algorithm (using information gain selection feature)

Here Insert Picture DescriptionHere Insert Picture Description

ID3 algorithm flaw:

a) ID3 not consider continuous features , such as length, density values are continuous, it can not be used in ID3. This greatly limits the use of ID3.

b) ID3 information using the large gain characteristic establishing priority node of the decision tree. It was soon discovered, under the same conditions, the value of more features than the value of the feature information gain less large . For example, a variable has two values, each 1/2, another variable for the three values, each of 1/3, in fact, they are completely uncertain variables, but take more than two values take three values of information gain is large. If correct this problem?

c) ID3 algorithm in the case of missing values do not consider

d) does not consider the problem of over-fitting

C4.5 algorithm (using the selection feature information gain ratio)

Here Insert Picture Description
Here Insert Picture Description

ID3, C4.5 improved with respect to:

对于ID3算法的第一个问题,不能处理连续特征, C4.5和CART的思路是将连续的特征离散化。
对于ID3算法的第二个问题,信息增益作为标准容易偏向于取值较多的特征的问题。C4.5引入信息增益比,可以校正信息增益容易偏向于取值较多的特征的问题。
对于ID3算法的第四个问题,C4.5引入了正则化系数进行初步的剪枝。

C4.5缺陷:

1)由于决策树算法非常容易过拟合,因此对于生成的决策树必须要进行剪枝。剪枝的算法有非常多,C4.5的剪枝方法思路主要是两种,一种是预剪枝,即在生成决策树的时候就决定是否剪枝。另一个是后剪枝,即先生成决策树,再通过交叉验证来剪枝。

2)C4.5生成的是多叉树,CART为二叉树,即一个父节点可以有多个节点。很多时候,在计算机中二叉树模型会比多叉树运算效率高。如果采用二叉树,可以提高效率。

3)C4.5只能用于分类,如果能将决策树用于回归的话可以扩大它的使用范围。

4)C4.5由于使用了熵模型,里面有大量的耗时的对数运算,如果是连续值还有大
量的排序运算。如果能够加以模型简化可以减少运算强度但又不牺牲太多准确性的话,那就更好了。

(3)决策树剪枝

分类回归树的递归建树过程,它实质上存在着一个数据过拟合问题,决策树模型复杂度过高。在决策树构造时,由于训练数据中的噪音或孤立点,许多分枝反映的是训练数据中的异常,使用这样的判定树对类别未知的数据进行分类,分类的准确性不高。因此试图检测和减去这样的分支,检测和减去这些分支的过程被称为树剪枝。决策树剪枝通常通过极小化决策树整体损失函数来实现

决策树的生成只考虑了通过提高信息增益或者信息增益比来更好的拟合数据集,而决策树剪枝通过优化损失函数来减小模型复杂度利用损失函数最小化原则进行剪枝就是利用正则化的极大似然估计进行模型选择。
C(T)表示模型对训练数据的预测误差,|T|为模型复杂度,参数 a \alpha 控制两者的影响。较小的 a \alpha 选择较复杂的模型
Here Insert Picture Description

三、CART决策树算法(利用基尼指数选择特征)

CART算法可以做回归,也可以做分类
两者的区别在于样本输出,如果样本输出是离散值,那么这是一颗分类树。如果果样本输出是连续值,那么那么这是一颗回归树。CART回归树和CART分类树的建立和预测的区别主要有下面两点:
1)连续值的处理方法不同
CART分类树采用的是用基尼系数的大小来度量特征的各个划分点的优劣情况。
CART回归树采用常见的和方差的度量方式
2)决策树建立后做预测的方式不同。

(1)算法基本步骤:

(1)决策树生成,基于训练数据的每一个特征的基尼指数进行特征选进而生成决策树,越大越好。
(2)决策树剪枝,用验证数据集对决策树进行剪枝并选择最优子树,用损失函数最小作为评判标准。

(2)基尼指数定义

Here Insert Picture Description
基尼指数越大,集合不确定性就越大。在特征A条件下集合D的基尼指数定义为:
Here Insert Picture Description
在选择特征时,计算每个特征对该数据集的基尼指数,选择具有最小的基尼指数的特征,

(3)CART决策树剪枝

分两步:
1、剪枝,形成一个子树序列。
2、在剪枝得到的子树序列中通过交叉验证在独立的验证数据集上选取平方误差或基尼指数最小的子树作为最优的子树返回。

Here Insert Picture DescriptionHere Insert Picture Description

三、决策树代码实现

分为两种实现
sklearn(调用库)
python3(机器学习实战源代码实现)
github代码

sklearn实现

sklearn.tree.DecisionTreeClassifier(criterion=’gini’, splitter=’best’, max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, class_weight=None, presort=False)

sklearn.tree.DecisionTreeClassifier模型参数

大部分参数与回归树参数一致,除了类别权重class_weight。

  • 特征选择标准criterion,可以使用"gini或者"entropy",前者代表基尼系数,后者代表信息增益。默认基尼系数"gini",即CART算法。
  • 特征划分点选择标准splitter, 可以使用"best"或者"random"。前者在特征的所有划分点中找出最优的划分点。后者是随机的在部分划分点中找局部最优的划分点。默认的"best"适合样本量不大的时候,而如果样本数据量非常大,此时决策树构建推荐"random。
  • 划分时考虑的最大特征数max_features,
    如果是int,则max_features=n_features的绝对值。
    如果是float,则max_features等于n_features*float取整后的特征数。
    int(max_features * n_features)要素。
    如果是“auto”,则max_features = sqrt(n_features)。
    如果是“sqrt”,则max_features = sqrt(n_features)。
    如果是“log2”,则max_features = l O g 2 log_2 (n_features)。
    如果为None,则max_features = n_features,考虑所有的特征。
  • 决策树最大深度max_depth,树的最大深度,也就是说当树的深度到达max_depth的时候无论还有多少可以分支的特征,决策树都会停止运算.
  • 内部节点再划分所需最小样本数min_samples_split,这个值限制了子树继续划分的条件,如果某节点的样本数少于min_samples_split,则不会继续再尝试选择最优特征来进行划分默认为2。,样本数量大可以适当提升这个值。
  • 叶子节点最少样本数min_samples_leaf, 这个值限制了叶子节点最少的样本数,如果某叶子节点数目小于样本数,则会和兄弟节点一起被剪枝。
  • 叶子节点最小的样本权重和min_weight_fraction_leaf,有较多样本有缺失值,或者分类树样本的分布类别偏差很大,就会引入样本权重。
  • 最大叶子节点数max_leaf_nodes, 通过限制最大叶子节点数,可以防止过拟合,默认是"None”,
  • 类别权重class_weight,指定样本各类别的的权重,主要是为了防止训练集某些类别的样本过多,导致训练的决策树过于偏向这些类别。也可以为balanced,则算法会自己计算权重,样本量少的类别所对应的样本权重会高。或者自己设定权重。默认为none。 不适用于回归树
  • 节点划分最小不纯度min_impurity_split,这个值限制了决策树的增长,如果某节点的不纯度(基尼系数,信息增益,均方差,绝对差)小于这个阈值,则该节点不再生成子节点。即为叶子节点 。

决策树调参要点:

1、如果样本数量少但是样本特征非常多,在拟合决策树模型前,推荐先做维度规约,比如主成分分析(PCA),特征选择(Losso)或者独立成分分析(ICA)。
2、推荐多用决策树的可视化(下节会讲),同时先限制决策树的深度
3、使用min_samples_split 或者 min_samples_leaf 这两个参数来控制每个叶子节点上面的样本数量。小数量通常意味着树会过拟合,而太大的数量则会使得树难以从数据中进行学习。尝试 min_samples_leaf=5 作为初始值。如果样本数量的变化过于剧烈,可以对这两个参数设置百分值。这两个参数的主要区别在于 min_samples_leaf 会在每个叶子节点里保存一个最小数量的样本,而 min_samples_split 能够创建出一个很随意的用来存放样本的小节点
4、如果输入的矩阵X 非常稀疏,那就建议在对它进行拟合使用spares.csc_matrix 函数,而在预测之前使用 spares.csr_matrix **函数进行压缩。(csc_matrix(column, 对列进行压缩) ,csr_matrix(row, 对行进行压缩))。相比于一个密集矩阵,在特征上包含许多零值的稀疏矩阵的训练时间会比前者快上几个数量级。

决策树可视化方法:

First build environment
The first step is to install graphviz. Download at: http://www.graphviz.org/.
If you are linux, you can install with apt-get or yum methods. If windows, on the official website to download msi file to install. Either linux or windows, Bahrain should set the environment variable, graphviz the bin directory to PATH
The second step is to install the python plugin graphviz: pip install graphviz
The third step is to install python plugin pydotplus. This is not much to say: pip install pydotplus
this environment will set the stage, if you still can not find graphviz, then, you can add this line of code inside: os.environ [ "PATH"] + = os.pathsep + 'C : / Program Files (x86) /Graphviz2.38/bin/ '
attention back path is your own install graphviz bin directory.

Code:

from sklearn import tree
import numpy as np
import pydotplus 
from sklearn import datasets,model_selection
from sklearn.metrics import accuracy_score
import os 
from IPython.display import Image       
os.environ['PATH'] += os.pathsep + 'D:\\Graphviz2.38\\bin\\'
import matplotlib.pyplot as plt
iris=datasets.load_iris() # scikit-learn 自带的 iris 数据集
X=iris.data
y=iris.target
X_train,X_test,y_train,y_test=model_selection.train_test_split(X,y,test_size=0.3,random_state=0)
clf = tree.DecisionTreeClassifier(criterion='entropy')
clf.fit(X_train, y_train)
with open("iris.dot", 'w') as f:
    f = tree.export_graphviz(clf, out_file=f)

 
dot_data = tree.export_graphviz(clf, out_file=None, 
                         feature_names=iris.feature_names,  
                         class_names=iris.target_names,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
graph.write_pdf("C:\\Users\\Administrator\\Desktop\\iris.pdf")#将写入pdf文件

The results shown below
The results shown below

Guess you like

Origin blog.csdn.net/qq_39751437/article/details/86550287