Machine learning portal (XI): Decision Tree - classification model can not only return

Decision Tree

Earlier we talked about linear regression and naive Bayes classification model. The former can only do return, the latter can only do classification. But the decision tree model to talk about in this article, but both can be used for classification, but also can be used for regression.

What is a decision tree

A decision tree is a very basic and common machine learning model.

A decision tree (Decision Tree) is a tree structure (binary or non-binary tree may be), each non-leaf node corresponds to a feature, each node representing a branch of the value of this feature, and each leaf node is stored a category or a regression function.

Using decision tree computation process is performed starting from the root, corresponding features extracted items to be classified, in accordance with the value selected output branch, down successively, until reaching the leaf node, leaf nodes storing the category or regression function as a result of output (decision) results.

Decision-making process decision tree is very intuitive, easy to understand, and the relatively small amount of computation. It is very important in machine learning them. If you want to list the "Top Ten machine learning models", then the decision tree should be among the top three.

An intuitive understanding of the decision tree

Below is an example of a decision tree:

5214391-807649b5268cd46b
enter image description here

The role of the tree, is whether or not to accept the Offer to make a judgment.

We see this tree a total of seven nodes, of which there are four leaf nodes and three non-leaf nodes. It is a classification tree, each leaf node corresponds to a category.

So there are four leaf nodes, say a total of four category? of course not! We can see from the figure, a total of only two categories: accept offer (acceptance) and decline offer (rejected).

When Theoretically, a classification tree with n leaf nodes (n> 1, only one result will not classified), it may correspond to 2 ~ n categories. Analyzing the different paths are possible to obtain the same result (the same thing).

For the above embodiment, the Offer Penalty for a get, to judge three conditions: (1) annual salary; (2) commute time; (3) free coffee.

These three conditions are clearly not the same degree of importance, the most important is the root node, the closer to the root node, the more important - if the annual salary of less than $ 50,000, will not consider a direct say no; when when wages sufficient, if the commute time is longer than an hour, do not go there to work; even if the commute time is not more than one hour, but also is not a free coffee, no do not go.

These three non-leaf node (including the root node), collectively referred to as decision nodes , each node corresponds to a determination condition, the determination condition of this condition, we called features . The above example is a classification tree has three features.

When we use the tree to determine a Offer of time, we need to extract from this Offer in annual salary, commuting time and it has three features free coffee out of these three values ​​(for example: [$ 65,000,0.5 hour, do not offer free coffee]) input to the tree.

The condition in accordance with a filter tree root bottom order, until reaching the leaf so far. The leaves reach the corresponding category is forecast results.

Building decision trees

Role in the process of decision trees is very simple, then how to construct a decision tree is it?

Speaking in front to obtain a model of the process is called training, then how can we train a decision tree to get it?

Simply speaking, the following steps:

  1. Preparing a plurality of training data (assuming that there are m samples);
  2. Each sample indicate the expected category;
  3. Select some artificial features (i.e., decision condition);
  4. All the necessary features to generate a corresponding value corresponding to each training sample - Characteristic values;
  5. The training data obtained through the above steps 1-4 to enter the training algorithm, the training algorithm by certain principles, determine the degree of importance of each feature, and then follow the importance of decision-making from high in the end, decision trees.

So in the end is kind of how training algorithm? What is the importance of the principle characteristics of the decision is it?

Several commonly used algorithms

Decision tree construction process is an iterative process. Each iteration, the use of different features as division points to the sample data is divided into different categories. Feature is used as the split point is called the split feature .

Select the split feature goal is to allow each division subset of the possible "pure", that is, try to make a split subset of samples belong to the same category.

How to split such that each subset of "pure", there are a variety of algorithms, we look at a few here.

ID3 algorithm

Let's take a look at the most direct and easiest ID3 algorithm (Iterative Dichotomiser 3).

The algorithm core are: to gain a measure of information, information gain the greatest selection of features split after split .

First we have to understand a concept - information entropy.

A random variable x is assumed that there are n values, respectively {x1, x1, ..., xn}, each of the probability values ​​are accessible to the {p1, p2, ..., pn}, then x the entropy is defined as:

Entropy(x)=−∑ni=1pilog2(pi)

Entropy indicates the degree of information chaos, the more confusing information, the greater the entropy.

Let S is the set of all samples, a total of all of the samples into n classes, then:

Entropy(S)=−∑ni=1pilog2(pi)

Wherein, pi is the sample belonging to the i-th category, probability of occurrence in the total sample.

The next concept to understand information gain , gain information for the formula (the formula is expressed in the sample set S T based on the feature to divide the acquired information gain):

InformationGain(T)=Entropy(S)−∑value(T)|Sv||S|Entropy(Sv)

among them:

  • S for the entire sample set, | S | S is the number of samples;
  • T is a characteristic of the sample;
  • value (T) wherein T is the set of all values;
  • v is a characteristic value of T;
  • Sv is the set of samples S to T in the value of v, | Sv | Sv is the number of samples.

C4k5

ID3 previously mentioned only the most simple decision tree algorithm.

It features a selection of information gain measure, although intuitive, but it has a big drawback : ID3 values will generally prefer more types of features as split feature.

Because many kinds of feature values ​​have a relatively large gain information - information gain reflects the degree of uncertainty to be reduced after a predetermined condition, it must be more detailed sub data set higher certainty.

Value is split more features, will be easy to split into fine result; finer division result, the information gain greater.

To avoid this deficiency, the ID3 algorithm based on the birth of its improved version: C4.5 algorithm .

C4.5 selection information gain ratio (Gain Ratio) - rather than simply proportional to the amount of - the branch as a selection criterion.

Information gain ratio is called by the introduction of a split information (Split Information) items, to punish the possibility of more value characteristics.

SplitInformation(T)=−∑value(T)|Sv||S|log|Sv||S|

GainRatio(T)=InformationGain(T)SplitInformation(T)

ID3 There is another problem : that can not be processed in a continuous range of values characteristic. For example, in the above example, assume there is a training sample wherein the age value is a real number (0,100) range. ID3 I do not know what to do.

C4.5 in this area also has to make up for the specific practices are as follows.

  • The samples need to be processed (corresponding to the entire tree) or a subset of samples (corresponding subtree) are sorted in ascending continuous variable size.
  • Assumed that the actual value of all the m feature sample data on a total of k (k <= m) th, k-1 then the total possible candidate segmenting threshold, the segmenting threshold value, the ranking of each candidate continuous midpoint of which two at the front and rear elements after feature value. According to these dividing points k-1 to a feature of the originally continuous, converted to the k-1 Bool features.
  • This optimal partition selected k-1 by the characteristic information gain ratio.

However, C4.5 there is a problem: When a | Sv | size with | S | size close to the time:

SplitInformation(T)→0,GainRatio(T)→∞

To avoid this situation leads to a fact irrelevant characteristics occupy the root node, you can use heuristic thinking, first calculate the amount of gain information for each feature in its high gain quantity of cases, only the application rate of information gain as a division standard.

C4.5 excellent performance and computing power and data requirements are relatively small features, making it one of the most commonly used machine-learning algorithm. Its position in practical applications, even higher than ID3.

CART

ID3 and C4.5 classification tree are constructed. There is an algorithm that is widely used in the decision tree, it is the CART algorithm.

CART algorithm full name is: Classification and Regression Tree, classification and regression trees. This is immediately obvious from the name, it not only can be used for classification, it can also be used to make the return.

CART algorithm operation procedure of ID3 and C4.5, and approximately the same, except that:

  1. CART not feature selection based on the amount of gain or rate of gain, but the Gini coefficient (Gini Coefficient). Wherein each minimum Gini coefficient is selected as the optimal cut point;
  2. CART is a strictly binary tree. Only half of each division.

There is the concept of special mention: Gini coefficient (Gini Coefficient).

Originally Gini coefficient is a statistical concept, put forward by the early 20th century Italian scholar Corrado Gini, is used to determine the fairness of the income distribution indicators. Gini coefficient itself is a logarithmic scale, ranging between 0 and 1.

When the Gini coefficient for the people to judge a country's income, the value is smaller, the average annual income distribution, whereas the more concentrated. When the Gini coefficient is 0, indicating annual income of this country is distributed equally among all citizens, whereas when the Gini coefficient is 1, then the country that year all income is concentrated in the hands of one person, the rest of the citizens have no income.

Before Gini coefficient, the US economist Max O. Lorenz proposed the "income distribution curve" (also known as the Lorenz curve concept) of. The figure is a Lorenz curve:

5214391-173e6180f1770d86
image

Figure the horizontal axis is the cumulative percentage of the population, and the vertical axis for that part of the income of the total income of the percentage of the total population, the red line represents the average population income distribution in the absolute state, while the orange curve is the Lorenz curve, the actual performance the income distribution.

We can see that the horizontal axis at 75%, if the red line basis, corresponding to the vertical axis is 75%, but in accordance with orange curve, the vertical axis corresponds to less than 40%.

A segment is a part of the area of red and orange curve sandwiched, and B is the area under the orange curve portion. Gini coefficient is actually the ratio of AA + B . This concept than well-known in machine learning in the field of economics.

Gini coefficients is calculated:

Gini(p)=∑ni=1pi(1−pi)=1−∑ni=1p2i

For binary classification, if the sample belongs to the first category is the probability p, then:

Gini(p)=2p(1−p)

In this case, if p = 0.5, the Gini coefficient is 0.5; if p = 0.9, the Gini coefficient 0.18.0.18 <0.5, according to the principle of CART, when P = 0.9, the more likely to be selected as a characteristic feature split.

Thus, for binary classification, the probability of two possibilities is more unequal, the more likely is a better prime cut points.

Although the above example used is binary, but in fact, for multi-classification, the trend is the same, those in the probability distribution more unequal among the different possibilities of features, the easier it becomes to split feature.

Here, and may have friends will misunderstand that we have been talking about when the practice is doing classification CART. But in fact, whether it is done or classification regressions, all the same.

Classification and regression trees difference tree that the final output value in the end is continuous or discrete, each feature - that is, split points decision condition - whether the characteristic value itself is continuous or discrete, it should be treated as a discrete to handle, and are converted into binary features is to be processed:

  • If the division corresponding features are continuous, processing similar C4.5 algorithm;
  • If the features are discrete, and wherein the total value of k, which will be converted to a characteristic feature k, for each new feature is not taken in accordance with the value of points and No. Yes

Note: there is a word --Gini index (Gini Index), often mentioned in some of the information and used to replace the Gini coefficient in the CART algorithm, in fact, Gini index is the Gini coefficient multiplied by 100 times as expressed as a percentage, both It is actually a thing.

Reproduced in: https: //www.jianshu.com/p/a786c55597d2

Guess you like

Origin blog.csdn.net/weixin_33749131/article/details/91288875