Decision tree machine learning principles and practice of sklearn

1. Scene Description

Time: eight in the morning, Location: matchmaking

'Damsel, I have given you found a suitable object, today do not see the side? '

'How much? ' '26 years old'

'Long handsome? '' Can be, too beautiful for words'

'Wages high? '' Slightly above average '

'Write code? '' People are programmers write code runs great! '

'Well, put him in touch with hair over it, I find time to see the side'

The scenario described above excerpt from <one hundred face machine learning> , is a typical decision tree classification problem, looks, wages, will introduce programming and other features of the property objects dating whether by age and classification

A decision tree is a top-down, process the sample data classification tree, and the nodes have the composition edges, each node (except leaf nodes) is a characteristic or attribute, leaf nodes represent category. Start from the top of the root node, all samples gathered in the instrument, after division of the root node, the sample is assigned to nodes in different sub. Still further divided according to feature child nodes, until the samples have been assigned to a category (leaf node) in

2. The principle decision tree

Decision Trees as the most basic and most common are supervised learning models often used for classification and regression problems, the decision tree application integration idea can get a random forest, gradient enhance the decision tree model. Its main advantage is that the model has a readable, fast classification speed. Decision Tree Learning normally involves three steps: feature selection, and generating a decision tree pruning of the decision tree, the following algorithm for feature selection and the differences will be described

2.1 ID3 --- the largest information gain

In the information theory and probability statistics, entropy (Entropy) is a measure of the uncertainty of a random variable, X is set to take a finite number of values of the random variables, the probability distribution: \ [P (X = X_i) P_i = (I 1,2 =, ..., n-) \] , the entropy of the random variable X is defined as: \ [H (X) = - \ sum_ = {I}. 1 ^ np_i \ log {P_i} \] expression the logarithm base 2 of the bottom or base e, time units are referred to as bit entropy or NAT, it can be seen from expression independent entropy value of X is X, the X entropy also referred to as \ ( H (P) \) , i.e. \ [H (p) = - \ sum_ {x = 1} ^ np_i \ log {p_i} \] the greater the entropy value, the greater the uncertainty of a random variable

Conditional entropy:

The conditional entropy H (Y | X) represents conditions which are known in the random variable X, the random variable Y uncertainty, to a conditional entropy of the random variable X is defined under the given conditions for the random variable Y given X Y, under conditions the entropy of the conditional probability distribution of the mathematical expectation of X \ [H (Y | X) = \ sum_ {i = 1} ^ nP (X = X_i) H (Y | X = X_i) \]

Information Gain: \ [G (D, A) = H (D) - H (D | A) \]

import pandas as pd
data = {
        '年龄':['老','年轻','年轻','年轻','年轻'],
        '长相':['帅','一般','丑','一般','一般'],
        '工资':['高','中等','高','高','低'],
        '写代码':['不会','会','不会','会','不会'],
        '类别':['不见','见','不见','见','不见']}
frame = pd.DataFrame(data,index=['小A','小B','小C','小D','小L'])
print(frame)
    年龄  长相  工资 写代码  类别
小A   老   帅   高  不会  不见
小B  年轻  一般  中等   会   见
小C  年轻   丑   高  不会  不见
小D  年轻  一般   高   会   见
小L  年轻  一般   低  不会  不见
import math
print(math.log(3/5))
print('H(D):',-3/5 *math.log(3/5,2) - 2/5*math.log(2/5,2))
print('H(D|年龄)',1/5*math.log(1,2)+4/5*(-1/2*math.log(1/2,2)-1/2*math.log(1/2,2)))
print('以同样的方法计算H(D|长相),H(D|工资),H(D|写代码)')
print('H(D|长相)',0.551)
print('H(D|工资)',0.551)
print('H(D|写代码)',0)
-0.5108256237659907
H(D): 0.9709505944546686
H(D|年龄) 0.8
以同样的方法计算H(D|长相),H(D|工资),H(D|写代码)
H(D|长相) 0.551
H(D|工资) 0.551
H(D|写代码) 0

Computing information gain: g (D, writing code) = 0.971 maximum, you can write code to split according to the decision tree

2.2 C4.5 --- the maximum information gain ratio

In order to gain information as divided training data set characteristics, there is another standard feature selection bias in the selection of more value problems, using information gain ratio can be corrected against the problem, which is
the information gain ratio defined as the information gain entropy g (D, a) and the training data set D on eigenvalues of a \ (h_a (D) \) ratio: \ [the g_R (D, a) = \ FRAC {g (D, a)} {h_a (D)} \]

\[H_A(D) = -\sum_{i=1}^n\frac{|D_i|}{|D|}\log\frac{|D_i|}{|D|}\]

Take the example described above, the ID3:
\ [Age H_ (D) = -1 / 5 * math.log (1 / 5,2) -4 / 5 * math.log (4 / 5,2) \]

\ [G_R (D, Age) = H_ {Age} (D) / g (D, age) = 0.171 / 0.722 = 0.236 \]

2.3 CART ---- maximum Gini index (Gini)

Gini described purity data, and information entropy meaning Similarly, classification, assume there are K classes, the probability of class k sample point data \ (PJc \) , the Gini index is defined as the probability distribution is:
\ [ gini (p) = 1- \ sum_
{k = 1} ^ Kp_k (1-p_k) = 1 - \ sum_ {k = 1} ^ Kp_ {k} ^ 2 \] for binary classification weak sample points belonging to probability of a class is p, then the probability distribution Gini index \ [the Gini (P) = 2P (. 1-P) \] , for a given sample geometry D, which Gini index \ [Gini (D) = 1 - \ sum_ {k = 1 } ^ K [\ frac {| C_k |} {| D |}] ^ 2 \] Note that this \ (C_k \) is a class D subset of samples belonging to class k, K is number of classes, several samples if the Gini index is defined according to D wherein a is a fetch may refer to a split into two portions D1 and D2, at the eigenvalues of a, D is set \ [the Gini (D, A) = \ FRAC {| D_1 |} {| D |} the Gini (D_1) + \ FRAC {| D_2 |} {| D |} the Gini (D_2) \]
\ [the Gini (D | age = old) = 1 / 5 * (1-1) + 4/5 * [1- (1/2 * 1/2 + 1/2 * 1/2)] = 0.4 \]

CART classification choices at each iteration the minimum cut Gini index and its corresponding feature points

2.4 ID3 difference, C4.5 and the Gini

From the perspective 2.4.1 Sample Type

From the perspective of the type of sample, ID3 can only deal with discrete variables, whereas treatment C4.5 and CART are continuous variables, C4.5 when processing continuous variables, data after passing through the sorting to find the different categories as a dividing line cut, according to cut points converted to bool continuous mathematics, thereby converting a plurality of continuous variables discrete value interval variable. For CART, since every time the binary division of the characteristics when its construction, and therefore well suited for continuous variables.

2.4.2 From the application point

ID3 and C4.5 is only applicable to the classification task, both for classification and CART can also be used for return

2.4.3 implementation details, the angle optimization

ID3 missing values ​​for sample characteristics are more sensitive, and C4.5 and CART can be processed in different ways to missing values, ID3, and C4.5 can produce multiple fork branches at each node entropy, and wherein the level of each Room not multiplexed, and each node only produce CART two branches, thus forming a binary tree, and each feature can be reused; to weigh the ID3 and C4.5 accuracy and pruning the tree by generalization, the direct use of all the data CART find all possible tree structure comparison.

3. The decision tree pruning

3.1 Why should prune?

To prune the tree in order to prevent over-fitting

According to Decision Tree algorithm generates a complex decision tree through the training data set, leading to the emergence of the test data set over - fitting, in order to solve the over-fitting, it is necessary to consider the complexity of the decision tree, decision tree pruning cut some of the branches to enhance the generalization ability of the model

Pruning of tree pruning is generally two methods, pre-and post-pruning

3.2 Pre-pruning

The core idea of ​​the pre-pruning is to be expanded before the nodes in the tree, first calculate whether the current division can bring to enhance the generalization ability of the model, if not, no longer continue to grow subtree. At this time, there may be different types of samples at the same time stored in the node, the node is determined in accordance with the principles of Category majority voting. There are several pre-pruning methods for tree growth when to stop

  • (1) When the depth of the tree reaches a certain time, to stop the growth of the tree
  • (2) when the leaf reaches a certain threshold number of nodes, and to stop the growth of the tree
  • (3) When the number of samples of the arrival node is less than a certain threshold value, stop the growth of the tree
  • (4) to calculate each division to enhance the accuracy of the test set, when less than a certain threshold when no longer continue to expand

Pre-pruning idea directly, the algorithm is simple, high efficiency characteristics, suitable for solving large-scale problems. But how accurately estimate when to stop growing the tree, will be very different for different issues, we need some experience to judge. And the pre-pruning some limitations, risk You Qian fitting

After pruning 3.3

The core idea is to make the pruning algorithm to generate a fully grown tree, and then calculate whether pruning from the ground up. The pruning process will delete sub-tree, instead of using a leaf node, the node is judged in the same category a majority vote principle. Likewise, the leaves can be cut is determined by the accuracy of the test set, if after the pruning accuracy has improved, the pruning can be obtained after pruning methods generally stronger generalization ability tree, but more time overhead

Loss function

\[C_a(T) = \sum_{t=1}^{|T|}N_tH_t(T) + a|T|\]

\ (Where | leaf node number of samples is the number of points, N_t is the junction point t, H_t (T) is the entropy of node t, a | | T T | is a penalty term, a> = 0 \)

\[C_a(T) = \sum_{t=1}^{|T|}N_tH_t(T) + a|T| = -\sum_{t=1}^{|T|}\sum_{k=1}^KN_{tk}\log \frac{N_{tk}}{N_t} + a|T|\]

Note: The above formula is \ (of N_ {TK} \ log \ FRAC {of N_ {TK}} {N_T} \) , instead of \ (\ frac {N_ {tk }} {N_t} \ log \ frac {N_ {tk}} {N_t} \ )

令:\[C_a(T) = C(T) + a|T|\]

\ (C (T) \) represents a prediction error model to the training data, i.e., the degree of fit model with training data, | T | represents complexity of the model, the parameter a> = 0 control the influence of both the larger prompted to select a relatively simple model prompted to select a smaller complex models, a = 0 means that only consider the fit of the model with training data, regardless of the complexity of the model

4. Use sklearn library satellite data sets and train a decision tree trimming

4.1 Requirements

  • a. Using make_moons (n_samples = 10000, noise = 0.4) to generate a set of satellite data
  • b. Using train_test_split () split the training and test sets
  • . C using cross-validation grid search to find the right parameters for the ultra DecisionTreeClassifier, Tip: Try a variety of value max_leaf_nodes
  • d. Use hyper-parameters of the entire training set for training and measuring the performance of the model on the test set

Code

from sklearn.datasets import make_moons
import numpy as np
import pandas as pd
dataset = make_moons(n_samples=10000,noise=0.4)
print(type(dataset))
print(dataset)
<class 'tuple'>
(array([[ 0.24834453, -0.11160162],
       [-0.34658051, -0.43774172],
       [-0.25009951, -0.80638312],
       ...,
       [ 2.3278198 ,  0.39007769],
       [-0.77964208,  0.68470383],
       [ 0.14500963,  1.35272533]]), array([1, 1, 1, ..., 1, 0, 0], dtype=int64))
dataset_array = np.array(dataset[0])
label_array = np.array(dataset[1])
print(dataset_array.shape,label_array.shape)
(10000, 2) (10000,)
# 拆分数据集
from sklearn.model_selection import train_test_split
x_train,x_test = train_test_split(dataset_array,test_size=0.2,random_state=42)
print(x_train.shape,x_test.shape)
y_train,y_test = train_test_split(label_array,test_size=0.2,random_state=42)
print(y_train.shape,y_test.shape)
(8000, 2) (2000, 2)
(8000,) (2000,)
# 使用交叉验证的网格搜索为DecisionTreeClassifier找到合适的超参数
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV

decisionTree = DecisionTreeClassifier(criterion='gini')
param_grid = {'max_leaf_nodes': [i for i in range(2,10)]}
gridSearchCV = GridSearchCV(decisionTree,param_grid=param_grid,cv=3,verbose=2)
gridSearchCV.fit(x_train,y_train)
Fitting 3 folds for each of 8 candidates, totalling 24 fits
[CV] max_leaf_nodes=2 ................................................
[CV] ................................. max_leaf_nodes=2, total=   0.0s
[CV] max_leaf_nodes=2 ................................................
[CV] ................................. max_leaf_nodes=2, total=   0.0s
[CV] max_leaf_nodes=2 ................................................
[CV] ................................. max_leaf_nodes=2, total=   0.0s
[CV] max_leaf_nodes=3 ................................................
[CV] ................................. max_leaf_nodes=3, total=   0.0s
[CV] max_leaf_nodes=3 ................................................
[CV] ................................. max_leaf_nodes=3, total=   0.0s
[CV] max_leaf_nodes=3 ................................................
[CV] ................................. max_leaf_nodes=3, total=   0.0s
[CV] max_leaf_nodes=4 ................................................
[CV] ................................. max_leaf_nodes=4, total=   0.0s
[CV] max_leaf_nodes=4 ................................................
[CV] ................................. max_leaf_nodes=4, total=   0.0s
[CV] max_leaf_nodes=4 ................................................
[CV] ................................. max_leaf_nodes=4, total=   0.0s
[CV] max_leaf_nodes=5 ................................................
[CV] ................................. max_leaf_nodes=5, total=   0.0s
[CV] max_leaf_nodes=5 ................................................
[CV] ................................. max_leaf_nodes=5, total=   0.0s
[CV] max_leaf_nodes=5 ................................................
[CV] ................................. max_leaf_nodes=5, total=   0.0s
[CV] max_leaf_nodes=6 ................................................
[CV] ................................. max_leaf_nodes=6, total=   0.0s
[CV] max_leaf_nodes=6 ................................................
[CV] ................................. max_leaf_nodes=6, total=   0.0s
[CV] max_leaf_nodes=6 ................................................
[CV] ................................. max_leaf_nodes=6, total=   0.0s
[CV] max_leaf_nodes=7 ................................................
[CV] ................................. max_leaf_nodes=7, total=   0.0s
[CV] max_leaf_nodes=7 ................................................
[CV] ................................. max_leaf_nodes=7, total=   0.0s
[CV] max_leaf_nodes=7 ................................................
[CV] ................................. max_leaf_nodes=7, total=   0.0s
[CV] max_leaf_nodes=8 ................................................
[CV] ................................. max_leaf_nodes=8, total=   0.0s
[CV] max_leaf_nodes=8 ................................................
[CV] ................................. max_leaf_nodes=8, total=   0.0s
[CV] max_leaf_nodes=8 ................................................
[CV] ................................. max_leaf_nodes=8, total=   0.0s
[CV] max_leaf_nodes=9 ................................................
[CV] ................................. max_leaf_nodes=9, total=   0.0s
[CV] max_leaf_nodes=9 ................................................
[CV] ................................. max_leaf_nodes=9, total=   0.0s
[CV] max_leaf_nodes=9 ................................................
[CV] ................................. max_leaf_nodes=9, total=   0.0s


[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:    0.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done  24 out of  24 | elapsed:    0.0s finished

GridSearchCV(cv=3, error_score='raise-deprecating',
       estimator=DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best'),
       fit_params=None, iid='warn', n_jobs=None,
       param_grid={'max_leaf_nodes': [2, 3, 4, 5, 6, 7, 8, 9]},
       pre_dispatch='2*n_jobs', refit=True, return_train_score='warn',
       scoring=None, verbose=2)
print(gridSearchCV.best_params_)
decision_tree = gridSearchCV.best_estimator_
{'max_leaf_nodes': 4}
# 使用测试集对模型进行评估
from sklearn.metrics import accuracy_score
y_prab = gridSearchCV.predict(x_test)
print('accuracy_score:',accuracy_score(y_test,y_prab))
accuracy_score: 0.8455
# 可视化模型
from sklearn.tree import export_graphviz

export_graphviz(decision_tree,
               out_file='./tree.dot',
               rounded = True,
               filled = True)

Generate tree.dot file, and then use the dot command \ [dot -Tpng tree.dot -o decisontree_moons.png \ ]

5. Appendix

5.1 sklearn.tree.DecisionTreeClassifier Class Description

5.1.1 DecsisionTreeClassifier Class Parameter Description

  • Criterion : feature selection mode, string, ( 'gini' or 'entropy'), default = 'gini'
  • Splitter : Each node split policy, ( 'best' or 'random '), string, default = 'best'
  • max_depth: int,default=None
  • min_samples_split : int, a float, default = 2, the minimum required number of samples before division
  • min_samples_leaf:
  • min_weight_fraction_leaf:
  • max_features:
  • random_state:
  • max_leaf_nodes:
  • min_impurity_decrease:
  • min_impurity_split:
  • class_weight:
  • presort : BOOL, default = False, for small data sets (several thousand less) presort = True set by the data pre-processing to speed up the training, but for a larger training set, it may slow down the speed of training

5.1.2 DecisionTreeClassifier Property Description

  • classes_:
  • feature_importances_:
  • max_features_:
  • n_classes_:
  • n_features_:
  • n_outputs_:
  • tree_:

5.2 GridSearchCV Class Description

5.2.1 GridSearchCV Parameter Description

  • Estimator : estimator, inherited from BaseEstimator
  • param_grid : dict, the key is the parameter name, the value of this parameter need to test the value of the option
  • scoring: default=None
  • fit_params:
  • n_jobs : Set the number of jobs to be run in parallel. 1 or a value of None, None represents 1 job, 1 represents all processors, default = None
  • CV : Number of cross-validation strategy, None, or Integer, None default represents 3-fold, integer to specify the number of folds "(stratified) KFold" in
  • verbose : output log Type

5.2.2 GridSearchCV Property Description

  • cv_results_: dict of numpy(masked) ndarray
  • best_estimator_:
  • best_score_: Mean cross-validated score of the best_estimator
  • best_params_:
  • best_index_: int,The index (of the ``cv_results_`` arrays) which corresponds to the best candidate parameter setting
  • scorer_:
  • n_splits_: The number of cross-validation splits (folds/iterations)
  • refit_time: float

References:

  • (1) <based machine learning practical scikit-learn and tensorflow>
  • (2) <face one hundred machine learning>
  • (3) Li Hang <statistical learning methods>

Guess you like

Origin www.cnblogs.com/xiaobingqianrui/p/11072556.html