Decision Tree ID3, C4.5, CART, Random Forest, bagging, boosting, Adaboost, GBDT, xgboost

Decision Tree

1, decision tree learning algorithm consists of three parts:

  • Feature Selection
  • Decision Tree
  • Tree pruning

Feature Selection

And we should be based on what criteria to determine the classification ability of a feature of it?

At this time, it is necessary to introduce a concept: information gain , here is the formula (xi which represents an event, P represents the probability):

                                                       H(x)= -\sum_{i=1}^{N}P\left ( x_{i} \right )logP\left ( x_{i} \right)

To give an example of entropy, begged me to go to school tomorrow in what way, the probability is 1/2 biking, walking 1/2, p (x1) = 1/2, p (x2) = 1/2, into the above formula on it.

Here is an example

https://www.bilibili.com/video/av29638322/?p=3

 

 

 When the outlook = sunny, to play probability is 2/5, the probability is 3/5 not to play into the formula on it.-\frac{2}{5}log_{2}\frac{2}{5}-\frac{3}{5}log_{2}\frac{3}{5}

 

Assuming that each record has a property "ID", when the ID according to segment words, since the ID is unique, in that a property, characteristic can be obtained is equal to the number of samples, that is to say the value of the tag ID number . So no matter what ID is divided, the value of the leaf node will have a great purity, get information gain will be great, but this division out of the tree is meaningless. Thus, ID3 decision trees tend to be more divided on the value of property, there is a certain preference. To minimize this effect, some scholars have proposed a classification algorithm C4.5.

(2), C4.5: select optimal segmentation algorithm based on attribute information gain ratio criterion

A gain ratio information is referred to as split information (Split information) by introducing a term to penalize more attribute values. In fact, is divided by its own entropy (if a split too many features, it's also a great entropy, divided about their own suppression)

                                                

The above formula, calculated molecular ID3 and as denominator the number of feature values is determined by the properties of A, the larger the number, the higher the IV value, the smaller the gain ratio information, and avoid many eigenvalues model preference attribute, but smart people will find a look, if simple rules to follow this division, the model will tend to feature a small number of features. So start with C4.5 decision tree to identify candidate division attribute information gain above-average properties from which to choose in the highest gain rate of.

For continuous attribute value, the number of values ​​can not be limited, so discretization techniques (e.g. dichotomy) process may be employed. The property values ​​in ascending order, and then select an intermediate value as the dividing point is smaller than the value of its points are divided into the left subtree, the point value is not less than its subtree has been assigned, information gain ratio calculated dividing choose the rate of information gain the greatest attribute value segmentation.

(3) CART: the Gini coefficient as a criterion to select the optimal division of property can be applied to the classification and regression

Sentence and almost entropy

CART is a binary tree using two yuan segmentation method, each time data is cut into two, respectively, into the left subtree, right subtree. And each non-leaf node has two children, so CART leaf nodes and more than 1 non-leaf. Compared ID3 and C4.5, CART applications to be more, both for classification can also be used to return. When CART classification, using the Gini index (the Gini) to select the best data dividing feature, the meaning is the purity, and similar entropy gini described. CART in each iteration will reduce the GINI coefficient.

                                  

Di represents the attribute value of A is divided into n number of branches in

Gini (D) reflects the purity of the D data set, the smaller the value, the higher purity. We choose the candidate set such that the minimum division Hou Jini attribute index optimization points as attributes.

Tree pruning

If the establishment of a complete set of decision tree training, will make the model too for the training data, fitting most of the noise, that is over-fitting phenomenon. To avoid this problem, use tree pruning the tree to take to solve the over-fitting phenomenon.

 

 

Specific calculation method

In order to seek information gain experience entropy seek first data, which is the probability of entropy formula, how to obtain the probability that count the number of each category, and then divided by the total number of data, not that the probability of it. The key points still have to seek conditional entropy. If there is a feature of the age, the age data which is {10, 50, 30, 60}, particularly how to divide it into several subsets? This own definition has been resolved.

Specific examples

https://blog.csdn.net/jiaoyangwm/article/details/79525237

To determine whether you can lend.

from math import log

"""
函数说明:创建测试数据集
Parameters:无
Returns:
    dataSet:数据集
    labels:分类属性
Modify:
    2018-03-12

"""
def creatDataSet():
    # 数据集
    dataSet=[[0, 0, 0, 0, 'no'],
            [0, 0, 0, 1, 'no'],
            [0, 1, 0, 1, 'yes'],
            [0, 1, 1, 0, 'yes'],
            [0, 0, 0, 0, 'no'],
            [1, 0, 0, 0, 'no'],
            [1, 0, 0, 1, 'no'],
            [1, 1, 1, 1, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [1, 0, 1, 2, 'yes'],
            [2, 0, 1, 2, 'yes'],
            [2, 0, 1, 1, 'yes'],
            [2, 1, 0, 1, 'yes'],
            [2, 1, 0, 2, 'yes'],
            [2, 0, 0, 0, 'no']]
    #分类属性
    labels=['年龄','有工作','有自己的房子','信贷情况']
    #返回数据集和分类属性
    return dataSet,labels


"""
函数说明:计算给定数据集的经验熵(香农熵)
Parameters:
    dataSet:数据集
Returns:
    shannonEnt:经验熵
Modify:
    2018-03-12

"""
def calcShannonEnt(dataSet):
    #返回数据集行数
    numEntries=len(dataSet)
    #保存每个标签(label)出现次数的字典
    labelCounts={}
    #对每组特征向量进行统计
    for featVec in dataSet:
        currentLabel=featVec[-1]                     #提取标签信息
        if currentLabel not in labelCounts.keys():   #如果标签没有放入统计次数的字典,添加进去
            labelCounts[currentLabel]=0
        labelCounts[currentLabel]+=1                 #label计数

    shannonEnt=0.0                                   #经验熵
    #计算经验熵
    for key in labelCounts:
        prob=float(labelCounts[key])/numEntries      #选择该标签的概率
        shannonEnt-=prob*log(prob,2)                 #利用公式计算
    return shannonEnt                                #返回经验熵


"""
函数说明:计算给定数据集的经验熵(香农熵)
Parameters:
    dataSet:数据集
Returns:
    shannonEnt:信息增益最大特征的索引值
Modify:
    2018-03-12

"""


def chooseBestFeatureToSplit(dataSet):
    #特征数量
    numFeatures = len(dataSet[0]) - 1
    #计数数据集的香农熵
    baseEntropy = calcShannonEnt(dataSet)
    #信息增益
    bestInfoGain = 0.0
    #最优特征的索引值
    bestFeature = -1
    #遍历所有特征
    for i in range(numFeatures):
        # 获取dataSet的第i个所有特征
        featList = [example[i] for example in dataSet]
        #创建set集合{},元素不可重复
        uniqueVals = set(featList)
        #经验条件熵
        newEntropy = 0.0
        #计算信息增益
        for value in uniqueVals:
            #subDataSet划分后的子集
            subDataSet = splitDataSet(dataSet, i, value)
            #计算子集的概率
            prob = len(subDataSet) / float(len(dataSet))
            #根据公式计算经验条件熵
            newEntropy += prob * calcShannonEnt((subDataSet))
        #信息增益
        infoGain = baseEntropy - newEntropy
        #打印每个特征的信息增益
        print("第%d个特征的增益为%.3f" % (i, infoGain))
        #计算信息增益
        if (infoGain > bestInfoGain):
            #更新信息增益,找到最大的信息增益
            bestInfoGain = infoGain
            #记录信息增益最大的特征的索引值
            bestFeature = i
            #返回信息增益最大特征的索引值
    return bestFeature

"""
函数说明:按照给定特征划分数据集
Parameters:
    dataSet:待划分的数据集
    axis:划分数据集的特征
    value:需要返回的特征的值
Returns:
    shannonEnt:经验熵
Modify:
    2018-03-12

"""
def splitDataSet(dataSet,axis,value):
    retDataSet=[]
    for featVec in dataSet:
        if featVec[axis]==value:
            reducedFeatVec=featVec[:axis]
            reducedFeatVec.extend(featVec[axis+1:])
            retDataSet.append(reducedFeatVec)
    return retDataSet


#main函数
if __name__=='__main__':
    dataSet,features=creatDataSet()
    # print(dataSet)
    # print(calcShannonEnt(dataSet))
    print("最优索引值:"+str(chooseBestFeatureToSplit(dataSet)))

result

第0个特征的增益为0.083
第1个特征的增益为0.324
第2个特征的增益为0.420
第3个特征的增益为0.363
最优索引值:2

2, Random Forests

Random Random Forest embodied in randomly selected samples, there is a randomly chosen feature.

Bagging is a method of Random Forests, Bagging is a method of sampling with replacement: Remove the sample into a training set, then put the sample back into the original sample space, so that the next sample can be taken to remain present in the sample. This way, the sample may comprise an m set of T training samples, the training set and each is different. Random Forests on the basis of Bagging on the further introduction of the training process decision tree in some randomness, randomly selected characteristics: random selection of some of the features, such as 80% of all the features, then at 80% of this feature, build decision trees . The actual situation for each feature probability selection (selection 0.8, 0.2 not selected).

3.bagging

Bagging is b ootstrap AG gregation acronym. There are constructed based on a random sample of n samples back into the set, we can train them separately to obtain n weak classifiers, and the structure of each weak classifier return, we can get some combination of strategies we use We need strong classifier.

Bagging have a very wide range of applications, such as random forest is to the n-th decision tree were bagging, and then select the most likely outcome by voting.

 

Machine Learning overfitting underfitting

Overfitting:

Causes overfitting:

Complexity (1) the model is too high. For example: network too

(2) excessive variable (feature)

(3) the training data is very small.

Solution:

(1) minimize the number of features (feature selection)

(2)early stopping

(3) Amplification data set

"Sometimes not as good algorithm won, but only because we have more data to win."

(4)dropout

At the beginning of the training, we randomly "Delete" hidden units 0.2--0.5, and see them as absent, after several iterations, until the end of the training, each of the 0.2--0.5 delete hidden units.

(5) comprises a regularization L1, L2

Regularization retains all the features of the variables, but will reduce the magnitude of the characteristic variable. Regularization is the use of penalty terms, by penalty terms, we can be of some value of the parameter becomes smaller. The value range is smaller, the more smooth the corresponding function, which is more simple function, it is not prone to overfitting.

(6) Cleaning data.

Underfitting:

Causes less fit:

Because the model is not sufficiently complex to capture the basic relational data model leads to erroneous representation of the data.

Solution:

1) add additional features items

2) add a polynomial wherein

For example, the linear model by adding the quadratic or cubic terms the model generalization stronger

3) reducing the regularization parameter

Regularization purpose is to prevent over-fitting, but now the model appeared less fit, you need to reduce the regularization parameter.

 

Why bagging is to reduce the variance (variance), while boosting reduce bias (bias)?

https://www.zhihu.com/question/26760839

 

 

 

 

 

2.boosting

boosting is an iterative process used to adaptively change the distribution of training samples, so that weak classifiers that are difficult to focus on the sample classification, it is the practice to each training sample is assigned a weight, in each one of the training automatically adjust the weights at the end.

There are boosting algorithm approach represents Adaboost, GBDT, XGBoost algorithm

The boosting algorithm is shown:

 

 

The most common method is AdaBoost [1], is adaptive boosting short, in order to make increase the readability of this article, I wrote an article Adaboost this alone, you can refer to specific

Orange had a son: AdaBoost algorithms and formulas fool step by step over to explain in detail with examples zhuanlan.zhihu.com

Look at an analogy, we have done the job right, you have to collect the fault of this problem? This question exam today, I did wrong, I put it in the wrong title in mind, before the next exam, I opened the wrong title alone would do it again alone. If the next test done right, it is deleted from the wrong title, otherwise, in the wrong questions in this work the problem to do it again. So every time down, you are likely to improve test scores. Boosting is one such principle.

AdaBoost (Adaptive Boosting) training process

  1. Right at the beginning of each training a training sample assigned equal weight, then the algorithm training set of training wheels t, after each training, training the training cases fail to assign greater weight. That is, let more attention to learning algorithms to learn the wrong sample after each study.
  2. Based on the value of training samples adjusted a base learners, and so forth, until the number of base learners achieve pre-established T.
  3. These learners T base-linear weighted combination.

Exponential loss function.

GBDT(Gradient Boosting Decision Tree)

  1. GBDT is a regression tree, not a classification tree
  2. The core GBDT accumulation result is that all the trees as the final result
  3. GBDT key point is the use of the negative gradient of the loss function to simulate (in place of) the residual, for this general loss function, requires only the first derivative can be a

GBDT is a decision tree (CART) algorithm-based learning is GB's, a regression tree , rather than the classification tree. Boost is the "upgrade" means, generally Boosting process is an iterative algorithm, each new training is to improve the previous results. With the groundwork in front of Adaboost, we should be able to easily understand the general idea.


The core GBDT is that: every tree learn that before any conclusions and tree residuals , this can get the true value of the residual value is a predicted increase amount accumulated. For example, A true age is 18 years old, but age-predicted first tree is 12 years old, 6 years old and worse, that the residual is 6 years old. Then the second tree where we set the age of 6 years old A to learn, if the second 6-year-old tree really can be assigned to the leaf nodes A, the true age of two trees that accumulate to the conclusion that A's; If the second tree to the conclusion that the age of 5, then a 1-year-old is still a residual presence, the third tree in the age of a becomes 1 year old, he continues to learn.

example

This residual is the actual value of the inside - the predicted value, but is actually employed using a negative gradient of the loss function to simulate residual

 

 

 

 

How to predict it

GBDT

The above is the residual value as to be fitted, GBDT loss function is to calculate the value of the negative gradient of the current model, it is estimated as a residual.

https://zhuanlan.zhihu.com/p/59434537

https://www.jianshu.com/p/005a4e6ac775

https://zhuanlan.zhihu.com/p/29765582

https://www.bilibili.com/video/av64320212/?p=35

scikit-learn GBDT parameter settings

sklearn.ensemble.GradientBoostingClassifier(loss=’deviance’, learning_rate=0.1, n_estimators=100, subsample=1.0, criterion=’friedman_mse’, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_depth=3, min_impurity_decrease=0.0, min_impurity_split=None, init=None, random_state=None, max_features=None, verbose=0, max_leaf_nodes=None, warm_start=False, presort=’auto’)

 

n_estimators : maximum number of iterations weak learners, or the maximum number of weak learners, the number of spanning tree.

MAX_DEPTH : maximum depth of the decision tree

Loss : loss of function that is our GBDT algorithm. Loss function regression model including mean square error "ls", absolute loss "lad", Huber loss "huber", quantile loss "quantile".

 

 

(3)、xgboost

Xgboost compared to GBDT, the more effective application of numerical optimization, the most important is the loss of function (prediction error value and the true value) becomes more complicated . The objective function is still the predictive value of all the trees of the sum equal to the predicted value.

The following loss function is introduced first derivative, second derivative. :

Good model requires two basic elements: First, have a good precision (ie good the fit), and second, the model should be as simple (complex models prone to over-fitting and more unstable) Therefore the first term on the right objective function we are building a model error term and the second term is a regularization term (that is, the penalty term complexity model)

Common error terms have squared error and logistic error, there is a common penalty term l1, l2 regularization, l1 being sucked model individual elements are summed, l2 regularization is an element squared.

Each iteration, are based on the existing trees on the increase in the residual between the predicted results of a tree with real values ​​to fit the front of the tree

FIG above objective function, the last line of the circle portion is actually the residual between the predicted value and the true value

First of training error are expanded:

xgboost the cost function for a second-order Taylor expansion, and used the residual sum of squares of the first-order and second-order derivatives

Further Study regularization term objective function:

The complexity of the number of branches of the tree can be used to measure the tree, branches of the tree, we can use the number of leaf nodes to represent

So the complexity of the expression tree: The first term on the right is the number of leaf nodes of T, the second term is a tree leaf node weights the weight w l2 regularization, regularization is to prevent excessive leaf node

In this case, each iteration, a tree corresponds to an increase in the original model, the objective function, we represent a tree by wq (x), and the structure including the right leaf node of the tree weight, W represents the weight (reflecting the predicted probability), q represents the structure (reflecting the tree where the index number of samples)

The resulting objective function evaluation parameters w guide, back to the objective function, the objective function value is found that is determined by the red block part:

Thus, xgboost iteration is the following indicators are defined by selecting the optimal gain formula dividing point:

So how to get the best combination of trees?

One approach is greedy algorithm through all features within a node, information gain is calculated according to the equation in accordance with each of the divided feature, find the point of maximum information gain split tree. Added new leaf penalty term corresponding to the pruning of the tree, when the gain is less than a certain threshold, we can cut this division. However, this approach does not apply when a large amount of data, so we need to use approximation algorithms.

Another method: XGBoost looking splitpoint, it does not enumerate all the eigenvalues, and will polymerization statistical characteristic values, according to the distribution density of the characteristic values , the area configuration of eigenvalue distribution histogram calculation, and then dividing the distribution forming a plurality of bucket (bucket), the same area for each bucket, the bucket on the boundary of the feature value as a split
candidate point, and through all the candidate division points to find the best split point.

Explanation approximation algorithm formulas above: the characteristic feature of k values ​​are sorted, calculates the feature value distribution, RK (z) indicates that for feature k, its characteristic value is less than weightings z sum total weight proportion, represents the degree of importance of these characteristic values, we calculated according to this ratio, the number of feature values ​​into bucket, each bucket of the same proportion, select these types of feature values ​​as the division boundary candidate points constituting the candidate set; selecting a candidate is to set conditions such that the difference between two adjacent candidate node split is less than a certain threshold

 

Based on the above explanation, we can get xgboost innovation compared to GBDT of:

In conventional GBDT group as CART classifier, linear classifier xgboost supports, this time corresponds xgboost positive with the L1 and L2 logistic regression (classification) or a linear regression (regression problem) of the item.

  • Traditional GBDT when optimizing use only the first derivative information, xgboost the cost function for a second-order Taylor expansion, while the use of first and second order derivatives . Incidentally, xgboost tool supports custom cost function, as long as the first and second order function derivative.
  • xgboost in the cost function added a regularization term for the complexity of the control model . Regularization term comprising a number of leaf nodes in the tree, the square of the L2 mode output score on each node and leaf. From Bias-variance tradeoff point of view, regularization reduces variance model to make the model easier to learn out to prevent over-fitting, which is xgboost better than a traditional characteristic of GBDT.
  • Shrinkage (reduction), equivalent to the learning rate (eta xgboost in) . Each iteration, adding new models, the coefficient 1 in front as a less than optimized to reduce speed, gradually approaching each take a small step than optimal model each pass a major step approach is more likely to avoid over-fitting phenomenon ;
  • Sampling column (column subsampling). xgboost draws practice random forests, support columns sampling (that is, each input features not all features), not only can reduce the over-fitting, but also to reduce the computation, which is also a characteristic different from the traditional gbdt xgboost of.
  • Ignore missing values: looking splitpoint, it does not feature a sample of the missing were traversing the statistics, only characteristic value corresponding to the value of the non-missing sample traverse the list of features to reduce the engineering skills by the spending time looking for sparse discrete features splitpoint
  • Partition specified direction missing values: direction of the branch can specify the default values ​​for missing values ​​or specified, in order to ensure completeness, will be treated separately missing sample characteristic value allocated to the leaf nodes left and right two leaf nodes case, assigned to the child node that bring big gain, which is the default orientation of child nodes, which can greatly enhance the efficiency of the algorithm.
  • Parallel processing: prior to the training, each pre-sorted to identify internal features of the candidate cutting points, and then saved as a block structure, later iterations this structure is repeatedly used, the amount of computation is greatly reduced. During the split node, you need to calculate the gain of each feature, and ultimately selected the biggest gain characteristics that do split, then the gain calculated for each feature can open multiple threads, namely the use of multiple threads in parallel on different characteristic attributes the best way to find the dividing point.

 

 

 

Published 102 original articles · won praise 117 · views 330 000 +

Guess you like

Origin blog.csdn.net/pursuit_zhangyu/article/details/102555363