About Decision Tree and Naive Bayes algorithm

This section focuses on data mining common classification decision tree and Naive Bayes algorithm.

Decision Tree Algorithm

Decision tree (Decision Tree, DT) classification is a simple and widely used classification techniques.

A decision tree is a predictive model tree, which is composed of nodes and edges to have a hierarchical structure thereof. Tree contains three kinds of nodes: the root node, internal nodes and leaf nodes. Only a tree root, is a collection of all the training data.

An internal node in the tree represents a test on the characteristic properties, this branch represents the corresponding output on a characteristic attribute value range. A leaf node storage of a category, that is, data with the classification tag set classification is the instance belongs.

1. Decision Tree Case

Using a decision tree is the decision-making process, beginning from the root node, the test properties to be corresponding feature classifiers and outputs the selected value according to the branch, until the leaf node category, stored in the leaf node as a decision result.

Figure 1 is a predict whether a person will buy a computer tree. The use of the tree, can classify new records. Starting at the root (of age), if a person's age, middle-aged, directly determine the person will buy a computer, if a teen, you need to further determine whether a student, if it is older, it is necessary to further determine its credit rating .

Decision tree to predict whether buying a computer
Figure 1 decision tree to predict whether the purchase of computers

A hypothesis clients have the following four attributes: the age of 20, low-income, student, credit in general. Through a decision tree to determine the root of the age, the judgment result for the customer A young people, in line with the left branch, and then determine whether the customer is a student A, user A judgment result is that students, in line with the right branch, the end-user A fall "yes" the leaf node on. A so predict customer will buy a computer.

2. The establishment of tree

There are a lot of decision tree algorithms, such as ID3, C4.5, CART and so on. These algorithms are used in top-down greedy algorithm decision tree, each internal node of the best properties are selected to split the node classification results can be divided into two or more sub-node continues this process until it decision trees can all training data to accurately classify, or all of the properties have been used so far.

1) Select feature

When the decision tree in accordance with a greedy algorithm, first need to feature selection, which property is used as the determination node. Selecting a suitable characteristic as the determination node, the classification can speed up and reduce the depth of the decision tree.

Wherein the target is selected such that after the classification data set relatively pure. How to measure the purity of a data set? Here it is necessary to introduce the concept of purity data - information gain.

Information is a very abstract concept. It is often said that a lot of information, or less information, but it is difficult to say how much a clear message in the end.

In 1948, the father of information theory, Shannon proposed the concept of "information entropy", the only solution quantitative measure of information issues. Popular terms, entropy can be interpreted as the probability of occurrence of certain information. Information entropy represents the uncertainty of information, when the probability of a variety of specific information that appears evenly distributed, the largest uncertainty, this time on the maximum entropy. On the contrary, when the probability of a particular message which is far greater than other specific information when a minimum of uncertainty, entropy at this time is very small.

So, when the establishment of the decision tree, you want to select features can make the information entropy of the classified data set as smaller, that is, the uncertainty decreases as much as possible. When selecting a set of feature data classification, data set information entropy will be smaller than before the classification of the classification, which is expressed as the difference information gain. Information can gain a measure of effect size for a feature classification results.

ID3 algorithm used as an attribute information gain metric method, that is, for each feature can be used as a tree node, if the information gain is calculated as a feature tree node. Then select the feature that the maximum information gain as the next tree node.

2) pruning

In the process of classification model built in, it is prone to over-fitting phenomenon. Over-fitting model refers to learning and training, the training samples to achieve very high accuracy of approximation, but an approximation error test samples with the training times showed falls and then rises phenomenon. When the over-fitting the training error is small, but big test error, it is not conducive to practical application.

Decision Tree of over-fitting can be repaired by some pruning. Pruning pruning and pre-pruning is divided into two kinds.

Pruning means in advance, the tree growth process, certain conditions used be limited, so that prior to produce a fully fit stop growing tree.

Pre-pruning determination method, there are many, for example, the information gain is smaller than a predetermined threshold value by the time that the decision tree pruning stop growing. But how to determine an appropriate threshold also requires a certain basis, the threshold is too high will lead to inadequate model fit, the threshold is too low and lead to over-fitting model.

Pruning means, after the completion of the decision tree growth, bottom-up manner according to the decision tree pruning. After pruning, there are two ways, one is to use new leaf node subtree Alternatively, the node class prediction by the sub-category tree data set majority decision, in place of the other branch of the subtree with the most commonly used subtree.

Pre-pruning may prematurely terminate the growth of the tree, and then prune generally capable of producing better results. However, after the pruning cut subtree, part of the tree growth process calculation is wasted.

3. Spark MLlib decision tree algorithm

Spark MLlib support continuous and discrete characteristic variables, that is, both support the prediction also support the classification.

In the Spark MLlib, when the decision tree is divided in accordance with the information gain selection feature, which uses the front pruning way to prevent over-fitting, when any one of the following conditions occur, the tree node Spark MLlib divided terminated forming the leaf nodes.

  • Tree height reaches the specified maximum height maxDepth.
  • All current properties caused split nodes zo Eaddo information less than a specified threshold value minInstancesPerNode.
  • Minimum number of samples divided nodes child nodes is less than a threshold minInstancesPerNode.

Spark MLlib decision tree algorithm is implemented by DecisionTree class, which supports two yuan or tag classification, and also supports the prediction. The user through the configuration parameters are classified Strategy to explain or predict, and using what methods are classified.

1) DecisionTree of Spark MLlib training function

DecisionTree call trainClassifier method to classify training parameters as shown below.

def trainClassifier(
input: RDD[LabeledPoint],
numClasses: Int,
categoricalFeaturesInfo: Map[Int,Int], impurity: String,
maxDepth:Int,
maxBins:Int): DecisionTreeModel

The training function returns a decision tree model, as a function of the meaning of each parameter.

name Explanation
Input Represents the input data set, each data point represents an element of RDD, each data point comprises a tag and data features, in terms of classification, the value tag is {0,1, ..., numClasses-1}.
numClasses Represents the number of classification, the default value is 2.
categoricalFeaturesInfo Stored mapping relationship discrete attributes, e.g., (5 → 4) represents the fifth feature data points are discrete attributes, there are four categories of values ​​{0,1,2,3}.
impurity It represents purity calculation method information, including information entropy or Gini parameter.
maxDepth It represents the maximum depth of the tree.
maxBins It represents the maximum value of each node branch.

2) the prediction function DecisionTree of Spark MLlib

DecisionTreeModel.predict may receive data input method parameters in different formats, including vector, RDD, returns the predicted value calculated. The API method is as follows.

def predict(features:Vector):Double
def predict(features: RDD[Vector]): RDD[Double]

Wherein the first prediction method for a data point is received, the input parameter is a feature vector of the input data points are described, returns the predicted value of the input data points. The second prediction method may receive a set of data points, the input parameter is a RDD, RDD Each element is described in a feature vector data point, the method for predicting the value for each data point returns RDD manner.

Example 4. Spark MLlib Decision Tree Algorithm

Example: Import training data set, the ID3 decision tree classification model, the selection information as the division gain characteristic parameters of purity, and finally use the constructed tree, two data samples are classified prediction.

This example uses the data stored in the document dt.data, provides six points of the feature data and the corresponding tags, the data shown below.

1 1:1 2:0 3:0 4:1
0 1:1 2:0 3:1 4:1
0 1:0 2:1 3:0 4:0
1 1:1 2:1 3:0 4:0
1 1:1 2:0 3:0 4:0
1 1:1 2:1 3:0 4:0

Each row of data is a file of data samples, wherein, as a first label which, as four 4 back feature value data samples in the format (key: value).

Code implementation shown below.

  1. import org.apache.spark.mllib.tree.DecisionTree
  2. import org.apache.spark.mllib.util.MLUtils
  3. import org.apache.spark. SparkConf {,} SparkContext
  4.  
  5. object DecisionTreeByEntropy {
  6. def main(args: Array[String]) {
  7. val conf = new SparkConf().setMaster(“local[4]”).setAppName ( “DecisionTreeByEntropy”)
  8. val sc = new SparkContext (conf)
  9.  
  10. // decomposition data and upload
  11. val data = MLUtils.loadLibSVMFile (sc, (“/home/hadoop/exercise/dt.data”))
  12. val numClasses = 2 // set number Classification
  13. val categoricalFeaturesInfo = Map [Int, Int] () // set input format
  14. val impurity = "entropy" // setting information gain is calculated val maxDepth = 5 // set the maximum height of the tree
  15. val maxBins = 3 // set as the maximum number of division data set
  16.  
  17. // Create and print model results
  18. val model =DecisionTree.trainClassifier(data,numClasses,categoricalFeaturesInfo,impurity,maxDepth,maxBins)
  19. printIn (“model.depth:”+ model.depth)
  20. printIn (“model.numNodes:” + model.numNodes) printIn (“model.topNode:” +model.topNode)
  21.  
  22. // predicted from data set to extract two data samples and print the results
  23. val labelAndPreds = data.take(2).map { point =>
  24. val prediction = model.predict(point.features)
  25. (point.label, prediction)
  26. }
  27. labelAndPreds.foreach(printIn)
  28. sc.stop
  29. }
  30. }

Run the above information code output decision classification tree constructed, including the height of the tree, the details of nodes and the root node of the tree, and the actual value and the predicted value of the two samples. the information is as follows.

model.depth:2
model.numNodes:5
model.topNode:id = 1, isLeaf = false, predict = 1.0 (prob = 0.6666666666666666),
impurity = 0.9182958340544896, split = Some(Feature = 0, threshold = 0.0, featureType = Continuous, categories = List())r
stats = Some(gain = 0.31668908831502096, impurity = 0.9182958340544896,
left impurity = 0.0, right impurity = 0.7219280948873623)
(1.0,1.0)
(0.0,0.0)

5. algorithm advantages and disadvantages

Decision tree classification algorithm is very popular. In general, the art does not require any knowledge or parameter settings, it can handle high-dimensional data. It represents knowledge is intuitive and descriptive, very easy to understand, help manual analysis. Step learning and decision tree classification is very simple, high efficiency. Decision Tree only once constructed, can be used repeatedly, but the maximum number of computing every prediction can not exceed the depth of the decision tree.

In general, the tree has better classification accuracy, but the decision tree may depend on the successful application of modeling data have.

Naive Bayes algorithm

Naive Bayes (Nawe Bayes) algorithm is a very simple classification algorithm. It's based on the idea that, for a given item to be classified, to solve the probabilities for each category appears in this condition arise, the largest of which, they believe this to be sorted items belong to which category.

Bayesian formula

The core Naive Bayesian classification algorithm is a Bayesian formula, i.e., P (B | A) = P (A | B) P (B) / P (A). Another form of expression will be more clearer, P (Category | feature) = P (wherein | category) P (Category) / P (feature).

If X is a group of data elements to be classified by the attributes described n, H is a hypothesis, such as X belongs to class C, the classification problem, calculating the probability P (H | X) meaning it is known tuple X each attribute value of the element corresponding to the obtained; probability class C V belong.

For example, the attribute value of X age = 25, income = $ 5 000, H corresponds to the assumption that, X will buy a computer.

  • P (H | X): indicates a known customer information age = 25, income = conditions under $ 5000, and the probability that the customer will buy a computer.
  • P (H): means that for any given customer information, the probability that the customer will buy a computer.
  • P (X | H): represents a known customer will buy a computer, then the client's property values ​​age = 25, income = $ 5000 of probability.
  • P (X): represents the set of all customer information, the customer attribute value of age = 25, income = $ 5000 of probability.

2. Works

1) Let D be the training set, each X sample is composed of attribute values of n, i.e., X = (x1, x2, ... , xn)), the corresponding set of attributes as A1, A2, A3, ..., An.

2) Suppose that there are m class labels, i.e., C1, C2, ..., Cm. To be classified for a certain element X, simple classifier will P (C1 | X) (i = 1,2, ..., m) is the maximum value as a class label class C1. Thus the goal is to find P (C1 | X) the maximum value (P (C1 in | X) = P (X | C1) P (C1) | P (X)

3) If the value of n is particularly large, i.e. sample tuples have many attributes, then for P (X | quite complex C1) is calculated. Therefore, naive Bayes algorithm made an assumption that the sample for each of the tuple attributes, since they are independent of each other conditions, so there is P (X | C1) = P (X1 | C1) P (X2 | C1) ... (Xn | C1). Since the number of probability can be calculated from the focus of training, the training sample space, belonging to the class C1 and A1 is equal to the corresponding properties of x1 divided by the number of samples C1 belong to the class of the sample space.

4) In order to predict the class label belongs X can be calculated from P (X C1 tag for each class corresponding to the foregoing step | P C1 (C1) value), when one tag class C1, for any j (1≤j≤ m, j ≠ i), there are P (X | C1) P ( 1)> P (X | C1) P (C1) , the tag is considered to belong to the class X C1.

3. Spark MLlib Naive Bayes algorithm

Spark MLlib naive Bayes algorithm is mainly calculated prior probability of each category, the conditional probability of each feature in each category attribute, a distributed implementation is that samples polymerization operation, count the number of all the tags appear, corresponding to and traits.

After the polymerization operation, the prior probability can be calculated by the result of the polymerization, the conditional probability to obtain Naive Bayes classification model. Predicting, a priori probability models, the conditional probability, a probability is calculated for each sample belonging to each class, the last sample taken as the largest category of items.

Spark MLlib support Multinomial Naive Bayes and Bernoulli Naive Bayes. Multinomial Naive Bayes main theme for text classification will take into account the number of occurrences of the word, that word frequency analysis, and Binarzied Multinomial Naive Bayes does not consider word frequency, considering only the word does not appear there, mainly for text sentiment analysis. Which model can be used by specified algorithm parameter.

Spark MLlib of Native Bayes classification method of training calls train, as shown in the following parameters.

def train(
input:RDD[LabeledPoint],
lambda:double,
modelType:String):NativeBayesModel

The training function will return a Naive Bayes model, the meaning of each parameter functions as follows.

  • input represents the input data set, each data point represents an element of RDD, each data point comprises a tag and data features, in terms of classification, the value tag is {0,1, ..., numClasses-1}.
  • is a smoothing parameter lambda addition, the default value is 1.0.
  • modelType is used to specify the use Multinomial Native Bayes model or Bernoulli Native Bayes algorithm, the default is Multinomial Naive Bayes.

Native Bayes method NativeBayesModel.predict prediction function of the prediction function Spark MLlib DecisionTree like, you may receive different data entry parameter, the vector packet, RDD, returns the predicted value calculated.

4. Spark MLlib Examples Naive Bayes

To buy computers sample data in Table 1 as the training data set, using Multinomial Native Bayes classification model established, and then use the constructed classification model to classify a data sample forecast.

Table 1 buy computers sample data
age income student credit_rating buys_computer
≤30 high no fair no
≤30 high no excellent no
31~40 high no fair yes
>40 medium no fair yes
>40 low yes fair yes
>40 low yes excellent no
31~40 low yes excellent yes
≤30 medium no fair no
≤30 low yes fair yes
>40 medium yes fair yes
≤30 medium yes excellent yes
31~40 medium no excellent yes
31~40 high yes fair yes
>40 medium no excellent no

This example uses the data stored in sample_computer.data document, each line of the data file is a data sample, wherein a first of its label as the back four columns four feature values ​​of the data samples. Tag characteristic value "," segmentation, feature values ​​separated by spaces, as shown below.

buys_computer,age income student credit_rating

其中,buys_computer 的取值为,no 为 0,yes 为 1;age 的取值为,≤30 为 0,31~40 为 1,>40 为 2;income 的取值为,low 为 0,medium 为 1,high 为 2;student 的取值为,no 为 0,yes 为 1,credit_rating 的取值为,fair 为 0,excellent 为 1。

对应于表 1 的数据的前 3 行数据如下。

实现的代码如下所示:

  1. import org.apache.spark.mllib.classification.{NaiveBayes,NaiveBayesModel}
  2. import org.apache.spark.mllib.linalg.Vectors
  3. import org.apache.spark.mllib.regression.LabeledPoint
  4. import org.apache.spark.{SparkContext,SparkConf}
  5.  
  6. object NaiveBayes {
  7.     def main (args : Array[String]) : Unit = {
  8.         val conf = new SparkConf().setMaster(“local”).setAppName(“NaiveBayes”)
  9.         val sc = new SparkContext(conf)
  10.         val path =“../data/sample_computer.data”
  11.         val data = sc.textFile(path)
  12.         val parsedData = data.map {
  13.             line =>
  14.                 val parts = line.split(‘,’))
  15.                 LabeledPoint (parts(0).toDouble,Vectors.dense (parts(1).split(”).map(_.toDouble)))
  16.         }
  17.         //样本划分 train 和 test 数据样本 60% 用于 train
  18.         val splits = parsedData.randomSplit(Array(0.6,0.4),seed = 11L)
  19.         val training = splits(0)
  20.         val test = splits(1)
  21.         //获得训练模型,第一个参数为数据,第二个参数为平滑参数,默认为1
  22.         val model = NaiveBayes.train(training,lambda = 1.0)
  23.         //对测试样本进行测试
  24.         val predictionAndLabel = test.map(p => (model.predict(p.features),p.label))
  25.         //对模型进行准确度分析
  26.         val accuracy = 1.0 *predictionAndLabel.filter (x => x._1 == x._2).count()/test.count.()
  27.         //打印一个预测值
  28.         printIn (“NaiveBayes精度——>” + accuracy)
  29.         printIn (“假如 age<=30, income=medium, student=yes,credit_rating=fair,是否购买电脑:” + model.predict(Vectors.dense(0.0,2.0,0.0,1.0)))
  30.         //保存model
  31.         val ModelPath = “../model/NativeBayes_model.obj”
  32.         model.save(sc.ModelPath)
  33.     }
  34. }

5. 算法优缺点

朴素贝叶斯算法的主要优点就是算法逻辑简单,易于实现;同时,分类过程的时空开销小,只会涉及二维存储。

理论上,朴素贝叶斯算法与其他分类方法相比,具有最小的误差率。但是实际上并非总是如此,这是因为朴素贝叶斯模型假设属性之间相互独立,这个假设在实际应用中往往是不成立的,在属性个数比较多或者属性之间相关性较大时,分类效果不好,而在属性相关性较小时,朴素贝叶斯算法的性能最为良好。

52.决策树和朴素贝叶斯算法
53.回归分析
54.聚类分析简介
55.k-means聚类算法
56.DBSCAN聚类算法
57.数据挖掘之关联规则分析
58.Apriori算法和FP-Tree算法
59.基于大数据的精准营销
60.基于大数据的个性化推荐系统
61.大数据预测
62.大数据的其他应用领域
63.大数据可以应用在哪些行业
64.大数据在金融行业的应用
65.大数据在互联网行业的应用
66.大数据在物流行业的应用

Guess you like

Origin blog.csdn.net/yuidsd/article/details/92418191