1. What is a decision tree
Decision Tree is an algorithm to solve classification problems.
Decision tree using a tree structure, the layers of reasoning to achieve the final classification.
By the decision tree below several elements:
- Root: The Complete Works containing samples
- Internal nodes: corresponding to characteristic properties test
- Leaf nodes: results represent decisions
Prediction, at an internal node in the tree is determined by the value of a property, determines which branch into the node based on the determination, until the leaf node is reached, the classification results obtained.
This is a supervised learning algorithm, the rules of the decision tree to get through training, rather than artificially developed.
A decision tree is the most simple machine learning algorithm, easy to implement, strong explanatory, intuitive and full compliance with human thinking, has a wide range of applications.
2. The principle decision tree
Decision tree structure:
A key step in constructing a decision tree is split property, so-called split property is divided according to the different structure of a characteristic attributes of different branches at a node, the goal is to make each subset split as much as possible "pure." As "pure" is to try to make a split subset of items to be classified in the same category.
Key elements constructing a decision tree is attribute selection measure, attribute selection measure is an option to split the guidelines, which determine the choice of topology and split the point split_point.
There are many algorithms attribute selection measure, the general use of top-down recursive divide and conquer, greedy and adopt policies are not backtracking.
3. Decision Tree combat explain iris classification problem
1 import numpy as np 2 3 from sklearn.tree import DecisionTreeClassifier 4 5 from sklearn import datasets 6 7 import matplotlib.pyplot as plt 8 %matplotlib inline 9 10 from sklearn import tree 11 from sklearn.model_selection import train_test_split
# Loading data Iris IRIS = datasets.load_iris () X- = IRIS [ ' Data ' ] Y = IRIS [ ' target ' ] # view Iris category name feature_names = iris.feature_names # data into training and testing data , the ratio is. 4:. 1 X_train, X_test, y_train, android.permission.FACTOR. train_test_split = (X-, Y, test_size = 0.2, random_state = 1024)
# Training data, type of prediction iris # Entropy entropy is the entropy classification criteria employed may also be used max_depth gini coefficient parameter indicates the depth of the tree, the default is the maximum depth, of course, the greater the depth of the tree, the higher the precision, = DecisionTreeClassifier CLF (Criterion = ' Entropy ' ) clf.fit (X_train, y_train) Y_ = clf.predict (X_test) from sklearn.metrics Import accuracy_score # calculation accuracy accuracy_score (y_test, y_)
# Check tree structure plt.figure (figsize = (18, 12, )) # Filled fill color _ = tree.plot_tree (CLF, Filled = True, feature_names = feature_names) plt.savefig ( ' ./tree.jpg ' )
Then analyze this tree:
(1) In fact, according to the entropy is divided, then what is the entropy (entropy) of it, what is the formula?
熵其实是信息论与概率统计学中的概念,但是在机器学习中用到的也很多.信息熵公式:代表随机变量不确定度的度量
不确定性的变化跟什么有关呢?
一,跟事情的可能结果的数量有关;二,跟概率有关
所以熵的公式: 或者
信息论之父克劳德·香农,总结出了信息熵的三条性质:
- 单调性,即发生概率越高的事件,其所携带的信息熵越低。极端案例就是“太阳从东方升起”,因为为确定事件,所以不携带任何信息量。从信息论的角度,认为这句话没有消除任何不确定性。
- 非负性,即信息熵不能为负。这个很好理解,因为负的信息,即你得知了某个信息后,却增加了不确定性是不合逻辑的。
- 累加性,即多随机事件同时发生存在的总不确定性的量度是可以表示为各事件不确定性的量度的和。
上例中根节点的熵的计算:samples为样本的数量,values为每种花的数量,entropy为熵的值
39/120*np.log2(120/39)*2+42/120*np.log2(120/42)=1.584
之后每个节点的熵都是该计算公式,通过第一个节点的分类,直接将第一类花分出来
(2)对于第一个分类条件,他是根据属性进行划分
鸢尾花有四种属性,第一次分可以根据训练样本的方差,方差越大,说明越离散,越容易分开,之后再使用各种方法判断四种属性就可以了,简言之,这个分类挺麻烦的,不过不是没有依据的,就是根据花的四种属性(花萼的长宽,花瓣的长宽)分类.最终得到三种纯的鸢尾花.
(3)上述的熵也可以改为gini系数,其实是一样的,公式如下