Everyone's Artificial Intelligence - Decision Tree

In Logistic regression, we are exposed to the classification tasks, today we are going to introduce is a decision tree, which is a classification and regression algorithm is used, this is mainly to discuss the decision tree for classification.

On the tree

It is not difficult to guess from the name in the form of a decision tree model is a tree structure, classification problems, feature-based classification for instance, we can imagine a series of if-else rules set by judging whether the features of these rules examples of classification.

Tree structure

For instance a tree as a tree structure, which consists of nodes and directional edges composition. It is divided into the node and the internal node leaf node, an internal node for determination of characteristics or attributes, corresponding to the leaf nodes represent classes.

Here Insert Picture Description

As can be seen from the above diagram, all of the attributes determined in each internal node samples were then divided into two portions, and therefore, how to select which properties are determined to obtain a tree of it quickly.

We hope that the internal nodes of the tree contains the sample belongs to the same class as much as possible, which is the junction of the "purity" of higher and higher.

Information gain

We use the entropy (information entropy) to measure the purity of the sample set, the samples for the k-th class set D, which accounted for pk (k = 1,2, ···, y), then the entropy defined in D for:
E n t ( D ) = k = 1 Y p k l O g 2 p k Ent (D) = - \ sum_ {k = 1} ^ yp_klog_2p_k
Value (D) of the Ent smaller, then the higher the purity of D.

Sample set D gain information on the attribute a (information gain) is:
G a i n ( D , a ) = E n t ( D ) v = 1 V D v D E n t ( D v ) Gain(D, a)=Ent(D) - \sum_{v=1}^V \frac{|D^v|}{|D|} Ent(D^v)
wherein V represents a possible values, D v D^v represents the value of a property of the number of samples V.

The larger the gain information, instructions for a property divided purity greater improvement can be obtained. But doing so has a flaw, more values ​​for attributes, such as ID, date, etc., using such a property will be divided to produce a great purity, but it is clear that this property is not suitable for use as division. In fact, gain information on the value of property will be a greater number of preferences, therefore, we do not directly use the information gain, while using the gain rate.

Gain rate

Information gain ratio (gain ratio) is defined as:
G a i n _ r a t i O ( D , a ) = G a i n ( D , a ) I V ( a ) Gain\_ratio(D, a) = \frac{Gain(D, a)}{IV(a)}

I V ( a ) = v = 1 V D v D l o g 2 D v D IV(a) = -\sum_{v=1}^V\frac{|D^v|}{|D|}log_2 \frac{|D^v|}{|D|}

IV (a) becomes an attribute of a value inherent (intrinsic value), a property more possible values, then the IV (a) larger.

However, the rate of gain has less of the value attribute preference, so the C4.5 algorithm does not gain the greatest attribute directly select the rate to be divided, instead of using heuristic way, first find out the information gain higher than average level attributes, and then select the highest gain rate.

Gini index

In the CART algorithm, the Gini index (Gini index) to select a division attribute, Gini purity measured value D of the data set.
G i n i ( D ) = k = 1 y k ̸ = k p k p k = 1 k = 1 y p k 2 Gini(D) = \sum_{k=1}^y \sum_{k' \not=k} p_kp_{k'} = 1 - \sum_{k=1}^yp_k^2

The Gini (D) from the reaction of the two random samples of the data set, the probability that they belong to different classes, so the Gini (D) is smaller, the higher the purity of the data set D, then the attribute a Gini index:
G i n i _ i n d e x ( D , a ) = v = 1 V D v D G i n i ( d v ) Gini \ _index (D, a) = \ sum_ {v = 1} ^ V \ frac {| D ^ v |} {| D |} Gini (d ^ v)
Therefore, choosing the smallest Gini index that property to be divided.

Tree pruning

When generates decision trees tend to produce very accurate training data classification, but for unknown data less accurate, there have been over - fitting, because during training too much to consider how to improve the accuracy of training data classification, while some of the training data as their own characteristics common to all the data and learning in, thus creating a more complex tree, so use pruning way to simplify the structure of the tree.

Pruning is divided into pre-pruning and pruning.

Pre-pruning in the Decision Tree, the first estimate before each node is divided, if not bring generalization performance, so do not be divided. However, when dealing with certain nodes, although not immediately bring to enhance the generalization, but the subsequent division of it based on the generalization may improve performance, pre-pruning this greedy nature will bring underfitting risk .

Then pruning is to train a good tree, and then from the bottom up to inspect the internal node, if it is replaced with a leaf node can bring generalization performance, then replace it with the leaf nodes. After pruning the less fit a small risk, but it need to be pruned after generation of a complete tree, so the time and cost than the final pre-pruning large trees.

Practice what

Beginners to see so many formulas are not shocked, do not worry, Scikit-learn has helped us to achieve better, we now take it to with a use, where to use data that is well-known iris flowers datasets it contains four characteristic calyx length, width calyx, petal length, petal width, three types, setosa, Versicolour, Virginica, a total of 150 pieces of data.

To facilitate the visualization, we first characterized by combinations of two train a decision tree model, and then look at its results:

for pairidx, pair in enumerate([[0, 1], [0, 2], [0, 3],
                                [1, 2], [1, 3], [2, 3]]):
    # We only take the two corresponding features
    X = iris.data[:, pair]
    y = iris.target

    # Train
    clf = DecisionTreeClassifier().fit(X, y)

    # Plot the decision boundary
    plt.subplot(2, 3, pairidx + 1)

    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, plot_step),
                         np.arange(y_min, y_max, plot_step))
    plt.tight_layout(h_pad=0.5, w_pad=0.5, pad=2.5)

    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    cs = plt.contourf(xx, yy, Z, cmap=plt.cm.RdYlBu)

    plt.xlabel(iris.feature_names[pair[0]])
    plt.ylabel(iris.feature_names[pair[1]])

    # Plot the training points
    for i, color in zip(range(n_classes), plot_colors):
        idx = np.where(y == i)
        plt.scatter(X[idx, 0], X[idx, 1], c=color, label=iris.target_names[i],
                    cmap=plt.cm.RdYlBu, edgecolor='black', s=15)

Here Insert Picture Description

Then use all the features of the training tree and draw the tree structure:

clf = DecisionTreeClassifier().fit(iris.data, iris.target)
plot_tree(clf, filled=True)

Here Insert Picture Description

A closer look will find that the above tree structure, properties are represented by the index, the category is represented by an array inconvenient to read, then the following tree structure with graphviz will appear more friendly:

dot_data = export_graphviz(clf, out_file=None, feature_names=iris.feature_names, class_names=iris.target_names,
                           filled=True, rounded=True, special_characters=True)
graph = graphviz.Source(dot_data)
graph.render('iris')

Here Insert Picture Description

Scan code concerned about micro-channel public number: Machine craftsmen, reply the keyword "decision tree" Get implementation code.

Here Insert Picture Description

Guess you like

Origin blog.csdn.net/LXYTSOS/article/details/94332355