Two. "Machine learning" decision tree and decision tree ID3 algorithm Introduction

1, the decision tree is a data mining algorithm is the most commonly used, is one of the most basic algorithm, and its concept is very simple
decision tree is the process of classifying data through a series of rules. It provides a method under what conditions will be similar to what the value of the rule.
2, tree species: Category tree | regression tree
classification tree decision tree to make discrete variables.
Regression Trees make a decision tree for continuous variables.
2, decision tree learning process
(1) feature selection
(2) constitute a decision tree
(3) pruning
This is probably one saying it
-------------------- ---------------
Here ID3 decision tree algorithm
----------------------------- ------
ID3 algorithm was first used by a classification Luo Sikun prediction algorithm (J. Ross Quinlan) presented in 1975 at the University of Sydney, the core algorithm is the "information entropy." ID3 algorithm to calculate the information gain for each attribute that is a good high information gain attributes, each division select the highest information gain criteria for the classification of property, repeat the process until generate a perfect decision tree classification training examples.
A decision tree is to classify the data, in order to achieve the purpose of forecasting. The decision tree method to form a decision tree based on the training data set, if the tree can not give the correct classification of all the objects, then select some exceptions added to the training set data, repeat the process until the formation of a correct decision set. Decision tree represents the tree structure of decision-making set.
By the decision tree nodes, branches and leaves. The top of the tree is the root node, each branch is a new decision node, or a leaf of the tree. Each decision node represents a problem or decision, generally corresponding to the attribute of the object to be classified. Each leaf node represents a possible classification results. From top to bottom along the tree traversal process, in each node will encounter a test, resulting in different branches for different test output of each node on the problem, and finally will reach a leaf node, this process It is to use decision tree classification process using a plurality of variables to determine the category belongs.
ID3 algorithmic information theory basis of :
(1) First, we must first understand entropy , which is the probability theory of knowledge, we may wish to entropy interpreted as the probability of occurrence of a particular information
object may be classified if divided in categories N in, respectively, x1, x2, ......, xn, to take each of the probabilities are P1, P2, ......, Pn, then the entropy of X is defined as:
Here Insert Picture Description
the smaller the value of H (X) (i.e. the simple tree), the higher the purity of X
(2), conditional entropy
assuming random variables (X, the Y), which is the joint probability distribution: P (X = xi, Y = yi) = pij, i = 1,2, ⋯, n; j = 1,2, ⋯, m

The conditional entropy (H (Y|X)) under conditions known to represent the random variable X the random variable Y uncertainty, which is defined as X in the entropy under the given conditions of the condition Y mathematical expectation of the probability distribution of X :
  Here Insert Picture Description
(3) information gain
 information gain information that represents the characteristics of the X, Y such that the degree of uncertainty is reduced (the information gain greater, the higher the purity of the X represents) is defined as: Here Insert Picture Description
 the ID3 algorithm selectable values a larger number of properties have a preference, in order to reduce the impact of this preference may bring, we can use the C4.5 decision tree algorithm, about 4.5 I will be a summary in the next chapter.
 About pseudo code, I directly on the map of the watermelon book:
 Here Insert Picture Description

Released two original articles · won praise 1 · views 397

Guess you like

Origin blog.csdn.net/ssssdww/article/details/103409987