[Technology Sharing] Decision Tree Classification

In this paper the author: Di Yin, issued after authorization.

Original link: https://cloud.tencent.com/developer/article/1550440

1 Decision tree theory

1.1 What is a decision tree

The so-called decision tree, as the name implies, is a kind of tree, a kind of tree built on the basis of strategic choices. In machine learning, a decision tree is a predictive model; it represents a mapping relationship between object attributes and object values. Each node in the tree represents an object, and each forked path represents a possible attribute value. The path from the root node to the leaf node corresponds to a decision test sequence. The decision tree has only a single output. If you want to have a complex output, you can build an independent decision tree to handle different outputs.

1.2 Decision tree learning process

The main purpose of decision tree learning is to produce a decision tree with strong generalization ability. The basic process follows a simple and direct "divide and conquer" strategy. Its process implementation is as follows:

输入:训练集 D={(x_1,y_1),(x_2,y_2),...,(x_m,y_m)};
      属性集 A={a_1,a_2,...,a_d}
过程:函数GenerateTree(D,A)
1: 生成节点node;
2: if D中样本全属于同一类别C then
3:    将node标记为C类叶节点,并返回
4: end if
5: if A为空 OR D中样本在A上取值相同 then
6:    将node标记为叶节点,其类别标记为D中样本数量最多的类,并返回
7: end if
8: 从A中选择最优划分属性 a*;    //每个属性包含若干取值,这里假设有v个取值
9: for a* 的每个值a*_v do
10:    为node生成一个分支,令D_v表示D中在a*上取值为a*_v的样本子集;
11:    if D_v 为空 then
12:       将分支节点标记为叶节点,其类别标记为D中样本最多的类,并返回
13:    else
14:       以GenerateTree(D_v,A\{a*})为分支节点
15:    end if
16: end for

The generation of decision tree is a recursive process. There are three situations that will cause a recursive return: (1) The samples contained in the current node all belong to the same category. (2) The current attribute value is empty, or all samples take the same value on all attributes. (3) The sample set contained in the current node is empty.

In the case of (2), we mark the current node as a leaf node, and set its category to the category with the most samples contained in the node; in the case of (3), we also mark the current node as a leaf Node, but its category is set to the category with the most samples in its parent node. The two processes are essentially different. The former uses the posterior distribution of the current node, while the latter uses the sample distribution of the parent node as the prior distribution of the current node.

1.3 Construction of decision tree

The key step in constructing a decision tree is to split attributes (that is, to determine different values ​​of attributes, corresponding to those in the above process a_v). The so-called split attribute is to construct different branches at a node according to the different division of a certain attribute, and its goal is to make each split subset as "pure" as possible. As "pure" as possible is to try to make the items to be classified in a split subset belong to the same category. Split attributes are divided into three different situations:

  • 1. The attributes are discrete values ​​and do not require the generation of binary decision trees. At this time, each division of attributes is used as a branch.
  • 2. Attributes are discrete values ​​and require the generation of binary decision trees. At this time, a subset of attributes is used for testing, and it is divided into two branches according to "belong to this subset" and "not belong to this subset".
  • 3. The attribute is a continuous value. At this time, a value is determined as a split point split_point, >split_pointand <=split_pointtwo branches are generated according to the sum .

1.4 Division options

In the decision tree algorithm, how to choose the optimal partition attribute is the most critical step. In general, as the division process continues, we hope that the samples contained in the branch nodes of the decision tree belong to the same category as much as possible, that is, the "purity" of the nodes becomes higher and higher. There are several indicators that measure the purity of a sample collection. In MLlib, information entropy and Gini index are used for decision tree classification, and variance is used for decision tree regression.

1.4.1 Information entropy

Information entropy is the most commonly used indicator to measure the purity of the sample set. Assuming that the proportion of Dthe ktype of samples in the current sample set is p_k, then Dthe information entropy is defined as:

Ent(D)The smaller the value, the Dhigher the purity.

1.4.2 Gini coefficient

Using the same symbols as above, the Gini coefficient can be used to measure Dthe purity of the data set .

Intuitively, Gini(D)it reflected from the dataset Dprobability randomly sampled two samples, its category labeled inconsistent. Therefore, the Gini(D)smaller, the Dhigher the purity of the data set .

1.4.3 Variance

MLlibVariance is used to measure purity. As follows

1.4.4 Information gain

Assuming that Nthe data set Dof the split size is the D_leftsum of two data sets D_right, then the information gain can be expressed as follows.

In general, the greater the information gain, the greater athe purity improvement obtained by using attributes to divide. Therefore, we can use information gain to select the attribute of decision tree. That is the eighth step in the process.

1.5 Advantages and disadvantages of decision trees

The advantages of decision tree:

  • 1 Decision tree is easy to understand and explain;
  • 2 Able to handle data-type and category-type attributes at the same time;
  • 3 The decision tree is a white box model. Given an observation model, it is easy to derive the corresponding logical expression;
  • 4 Be able to produce good results on large-scale data in a relatively short time;
  • 5 More suitable for processing samples with missing attribute values.

Disadvantages of decision trees:

  • 1 For data with inconsistent amounts of data in various categories, in decision tree species, the results of information gain are biased toward those with more numerical characteristics;
  • 2 Easy to overfit;
  • 3 Ignore the correlation between attributes in the dataset.

2 Examples and source code analysis

2.1 Examples

The following example is used for classification.

import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
//  Empty categoricalFeaturesInfo indicates all features are continuous.
val numClasses = 2
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "gini"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainClassifier(trainingData, numClasses, categoricalFeaturesInfo,
  impurity, maxDepth, maxBins)
// Evaluate model on test instances and compute test error
val labelAndPreds = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testErr = labelAndPreds.filter(r => r._1 != r._2).count().toDouble / testData.count()
println("Test Error = " + testErr)
println("Learned classification tree model:\n" + model.toDebugString)

The following example is for regression.

import org.apache.spark.mllib.tree.DecisionTree
import org.apache.spark.mllib.tree.model.DecisionTreeModel
import org.apache.spark.mllib.util.MLUtils
// Load and parse the data file.
val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
// Split the data into training and test sets (30% held out for testing)
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))
// Train a DecisionTree model.
//  Empty categoricalFeaturesInfo indicates all features are continuous.
val categoricalFeaturesInfo = Map[Int, Int]()
val impurity = "variance"
val maxDepth = 5
val maxBins = 32
val model = DecisionTree.trainRegressor(trainingData, categoricalFeaturesInfo, impurity,
  maxDepth, maxBins)
// Evaluate model on test instances and compute test error
val labelsAndPredictions = testData.map { point =>
  val prediction = model.predict(point.features)
  (point.label, prediction)
}
val testMSE = labelsAndPredictions.map{ case (v, p) => math.pow(v - p, 2) }.mean()
println("Test Mean Squared Error = " + testMSE)
println("Learned regression tree model:\n" + model.toDebugString)

2.2 Source code analysis

In MLlib, the realization of the decision tree and the realization of the random forest are together. In the realization of random forest, when the number of trees is 1, its realization is the realization of decision tree.

def run(input: RDD[LabeledPoint]): DecisionTreeModel = {
    //树个数为1
    val rf = new RandomForest(strategy, numTrees = 1, featureSubsetStrategy = "all", seed = 0)
    val rfModel = rf.run(input)
    rfModel.trees(0)
  }

Here strategyis Strategyan example, it contains the following information:

/**
 * Stores all the configuration options for tree construction
 * @param algo  Learning goal.  Supported:
 *              [[org.apache.spark.mllib.tree.configuration.Algo.Classification]],
 *              [[org.apache.spark.mllib.tree.configuration.Algo.Regression]]
 * @param impurity Criterion used for information gain calculation.
 *                 Supported for Classification: [[org.apache.spark.mllib.tree.impurity.Gini]],
 *                  [[org.apache.spark.mllib.tree.impurity.Entropy]].
 *                 Supported for Regression: [[org.apache.spark.mllib.tree.impurity.Variance]].
 * @param maxDepth Maximum depth of the tree.
 *                 E.g., depth 0 means 1 leaf node; depth 1 means 1 internal node + 2 leaf nodes.
 * @param numClasses Number of classes for classification.
 *                                    (Ignored for regression.)
 *                                    Default value is 2 (binary classification).
 * @param maxBins Maximum number of bins used for discretizing continuous features and
 *                for choosing how to split on features at each node.
 *                More bins give higher granularity.
 * @param quantileCalculationStrategy Algorithm for calculating quantiles.  Supported:
 *                             [[org.apache.spark.mllib.tree.configuration.QuantileStrategy.Sort]]
 * @param categoricalFeaturesInfo A map storing information about the categorical variables and the
 *                                number of discrete values they take. For example, an entry (n ->
 *                                k) implies the feature n is categorical with k categories 0,
 *                                1, 2, ... , k-1. It's important to note that features are
 *                                zero-indexed.
 * @param minInstancesPerNode Minimum number of instances each child must have after split.
 *                            Default value is 1. If a split cause left or right child
 *                            to have less than minInstancesPerNode,
 *                            this split will not be considered as a valid split.
 * @param minInfoGain Minimum information gain a split must get. Default value is 0.0.
 *                    If a split has less information gain than minInfoGain,
 *                    this split will not be considered as a valid split.
 * @param maxMemoryInMB Maximum memory in MB allocated to histogram aggregation. Default value is
 *                      256 MB.
 * @param subsamplingRate Fraction of the training data used for learning decision tree.
 * @param useNodeIdCache If this is true, instead of passing trees to executors, the algorithm will
 *                      maintain a separate RDD of node Id cache for each row.
 * @param checkpointInterval How often to checkpoint when the node Id cache gets updated.
 *                           E.g. 10 means that the cache will get checkpointed every 10 updates. If
 *                           the checkpoint directory is not set in
 *                           [[org.apache.spark.SparkContext]], this setting is ignored.
 */
class Strategy @Since("1.3.0") (
    @Since("1.0.0") @BeanProperty var algo: Algo,//选择的算法,有分类和回归两种选择
    @Since("1.0.0") @BeanProperty var impurity: Impurity,//纯度有熵、基尼系数、方差三种选择
    @Since("1.0.0") @BeanProperty var maxDepth: Int,//树的最大深度
    @Since("1.2.0") @BeanProperty var numClasses: Int = 2,//分类数
    @Since("1.0.0") @BeanProperty var maxBins: Int = 32,//最大子树个数
    @Since("1.0.0") @BeanProperty var quantileCalculationStrategy: QuantileStrategy = Sort,
    //保存类别变量以及相应的离散值。一个entry (n ->k) 表示特征n属于k个类别,分别是0,1,...,k-1
    @Since("1.0.0") @BeanProperty var categoricalFeaturesInfo: Map[Int, Int] = Map[Int, Int](),
    @Since("1.2.0") @BeanProperty var minInstancesPerNode: Int = 1,
    @Since("1.2.0") @BeanProperty var minInfoGain: Double = 0.0,
    @Since("1.0.0") @BeanProperty var maxMemoryInMB: Int = 256,
    @Since("1.2.0") @BeanProperty var subsamplingRate: Double = 1,
    @Since("1.2.0") @BeanProperty var useNodeIdCache: Boolean = false,
    @Since("1.2.0") @BeanProperty var checkpointInterval: Int = 10) extends Serializable

The realization of decision tree is introduced in the topic of random forest classification. Here we only need to know that when the number of trees in the random forest is 1, it is a decision tree, and at this time, the features used in the training of the tree are all features, not some features selected randomly. Ie featureSubsetStrategy = "all".

 

Professional AI developer community, looking forward to your visit!

Smart Titan AI Developer-Cloud + Community-Tencent Cloud cloud.tencent.com

Guess you like

Origin blog.csdn.net/qq_42933419/article/details/105250277