spark machine learning decision tree principle (a)

1. What is a decision tree  
tree (decision tree) is a tree structure (binary or non-binary tree may be). Decision tree classification trees and regression trees into two types of decision tree classification tree made of discrete variables, regression trees to make a decision tree for continuous variables.
  Each non-leaf node represents a test on a feature attribute, each branch represents the output characteristic property on a range of leaf nodes are each stored in a category.
  Decision tree using decision process is to start from the root node, the test to be classified corresponding characteristic attribute item, and its value is selected in accordance with the output branch, reach the leaf node know the category stored in the leaf node as a decision result.
Decision tree learning algorithm consists of three section

1.1 Feature selection
  Feature selection refers to selecting a feature from the training data in many standard features as splitting the current node, how to select the feature has a lot of different quantitative assessment criteria, which have shown different decision tree algorithm.
1.2 Decision Trees
  The feature selection evaluation criteria, from top to bottom recursively generating sub-nodes, the data set can not be separated until the tree is stopped to stop growing. Tree, the recursive structure is most easily understood way.
1.3 Decision Tree pruning
tree is easy to over-fitting, generally need to be pruned, narrow tree structure rules to ease the over-fitting, with a pre-pruning and pruning pruning two kinds.

2. Examples of
day, the teacher asked a question, just how to judge a student's gender based on hair and voice.
To solve this problem, the students at once simple statistical correlation features seven students, the data are as follows:

students A thought, according to the hair before judgment, if not judge, then judge according to the sound, then he drew a map, as follows:


students B, would like based on sound judgment, and then judged according to the hair, in the case he waved his hand also drew a tree:

Student A and student B who better decision tree?

3. Decision Tree feature selection
we can use a variety of methods divide the data set, but each method has its own advantages and disadvantages. So we thought, if we can measure the complexity of the data, the comparison of the data classified according to characteristics of different complexity, if classified according to the characteristics of a more reduced complexity, this feature being an optimal classification characteristic.
Claude Shannon defined the entropy (entropy) and information gain (information gain).
3.1 Entropy
first look at the amount of information: the amount of information is a measure of information, just as a measure of time in seconds, when we consider a discrete random variable x, when we observe a specific value to this variable when, how much information we receive it?
With size information about the probability of random events. The smaller the probability of a greater amount of information generated things happen, such as the Chinese soccer team won the World Cup, the greater the probability happened smaller amount of information generated (Well certainly happen as the sun rose from the east of it, nothing the amount of information).

In probability theory and information theory, entropy (entropy) used to mean "a random variable is a measure of uncertainty,"

the total amount of X representative sample of data, n represents the classification results, the probability of xi p (xi) on behalf of (the result is one of the categories probability) :
examples:

in the 15 data, the results are classified into two, not lending, or lending, the result of nine data lend results 6 data is not lending. So the information entropy data set of X

3.2 gain information (information gain)
 as we have said, how to select features, you need to see the information gain. In other words, information gain is, the larger information gain with respect to the characteristics, the greater the impact on the characteristics of the final classification results, we should choose the biggest impact on the final classification results of that feature as our classification features.
Before explaining the information gain definition, we also need a clear concept of conditional entropy.

Next, let's talk about the information gain. Mentioned earlier, the information gain with respect to the characteristic terms. Therefore, characteristic A training dataset D information gain G (D, A), defined as the set D entropy H (D) and wherein A given under D experiences the conditional entropy H (D | A) difference , namely:

to loan application sample data table as an example. Look at this age-disaggregated data, which is characteristic A1, a total of three categories, namely: youth, middle age and old age. We look at the young age of the data, age data of a total of five young, so young age, the probability of data appearing in the training data set is a five fifteenths, which is one-third. Similarly, age is the probability of middle-aged and older data appears in the training data set are also one-third. Now we look at the probability of age is finally obtained a loan of youth data for two fifths, because in five data, the data show only two got the final loan Similarly, data age adults, and elderly the final probability to get a loan were three-fifths, four-fifths. Therefore, the age information gain is calculated, as follows:

Similarly, the information gain is calculated g (D, A2) of the remaining features, g (D, A3) and g (D, A4). They are as follows:


Finally, information gain characteristics of comparison, since the information gain value (own house) features the largest A3, A3 so as to select the optimum characteristics.

 

Guess you like

Origin www.cnblogs.com/xiguage119/p/11015677.html