Decision tree model introduced

1, decision trees

Decision tree is a tree structure, each of which corresponds to a leaf node of a tree, the leaf nodes corresponding to non-divided on a property of the sample according to the different values ​​in the attribute will be divided into several subsets.

The core problem of decision tree structure: how to choose the appropriate attribute at each step of the sample do split.

Decision tree process: classification issues should be known class training samples labeled learning and construct a decision tree, top to bottom, to resolve separately.

2, commonly used decision tree algorithm

Decision Tree Algorithm Algorithm Description
ID3 algorithm Core: levels in the node tree, the method using information gain property as a selection criterion, to help determine the appropriate attribute is generated by each node should be used.
C4.5 algorithm C4.5 decision tree generation algorithm with respect to important improvements of ID3: gain ratio using the information selected node attributes. C4.5 may lack the customer service of ID3: ID3 algorithm is only applicable to discrete descriptive attributes, and C4.5 algorithm can deal with discrete descriptive attributes can also handle continuous descriptive attributes
CART algorithm CART decision tree is a non-parametric classification and regression methods, by building a tree, pruning the tree, the tree to construct a binary tree evaluation. When the endpoint is a continuous variable, the regression tree is a tree; when the endpoint is categorical variables, the tree is a classification tree

3, ID3 algorithm

ID3 algorithm principle

ID3 algorithm is based on information entropy to choose the best attribute test

It currently selected sample set has a maximum value of the attribute information gain as a test property

Dividing the sample set is based on the value of the properties were tested, how many different test property values ​​will be divided into a number of sub-sample set sample set, while the corresponding decision tree to grow new leaf node to node of the sample set.

ID3 algorithm according to information theory, partitioning using sets of samples as a measure of the uncertainty of the quality divided by the value of the gain measure of the uncertainty information, information gain value larger, not smaller uncertainty.

ID3 algorithm selects the maximum gain of attribute information as a test attribute in each non-leaf node, so you can get the current circumstances the most pure resolved, resulting in a smaller tree.

Let S be a set of data samples of s, is assumed class attribute having different values of m, C I (I = 1,2, ...., m), set s I is a Class C I number of samples. For a given sample, it is the total entropy:

 Wherein, P I is the sample belongs to any C I probability, it can generally S I estimation / s

A property set having a different value of k A { . 1 , A 2 , ... A k }, using the attribute A set S is divided into the subsets S { . 1 , S 2 , ..., S k }, wherein S J contains a property set S a taken a J sample values. If the selected property A property test, a subset of these new leaf node is growing from the node set S

. Set S ij of a subset S J in category C I number of samples, according to the entropy property A sample is divided into:

among them,

It is a sub-set of S j in category C I probability of a sample. Finally, with the attribute information gain after the sample set S A division is obtained: Gain (A) the I = (S . 1 .s 2 ...., S m ) -E (A)

 

E (A), the larger the value of Gain (A), indicating selection for attribute A larger test provides information classification, the classification select A smaller uncertainty Chengdu, k corresponding to different values ​​of the attribute A sample set S of k subsets or branched, by recursively calling the process described above, to generate additional attributes as a branch node and a child node to generate the entire decision tree. ID3 decision tree algorithm as selected as a typical decision tree learning algorithm, which is the core attributes as a criterion in the decision tree nodes with all levels of information gain, so that each non-leaf node when tested on, have access to minimum maximum gain of classification, the classified data set of entropy. This approach makes a smaller average depth of the tree, so as to effectively improve the classification efficiency.

ID3 algorithm flow

(1) for the current sample set, all attributes information gain is calculated;

(2) select the maximum gain as a test property attribute information, the attribute value of the same test sample into the same sub-sample set;

(3) If the sub-category attribute sets of samples containing only a single attribute, then the branch is a leaf node, its property values ​​is determined and marked on the respective symbols, and then returns the call; otherwise recursively sub-set of samples of the present algorithm.

# - * - Coding: UTF-. 8 - * - 
# using the ID3 algorithm predicted sales level 
Import PANDAS PD AS 

# Initialize 
inputfile = ' ../data/sales_data.xls ' 
Data = pd.read_excel (inputfile, index_col = U ' ID ' ) # import data 

# data class labels, to convert it to data 
# by 1 to indicate "good", "a", "high" of these three attributes is represented by -1, "bad", "No", "low" 
Data [Data == U ' good ' ]. 1 = 
Data [Data == U ' are ' ]. 1 = 
Data [Data == U ' high ' ]. 1 = 
Data [Data! =. 1] = -1 
X = data.iloc [:,:. 3 .] .As_matrix () asType (int) 
Y = data.iloc [:,. 3 .] .As_matrix () asType (int) 

from sklearn.tree import DecisionTreeClassifier AS the DTC 
DTC = the DTC (Criterion = ' entropy ' ) # decision tree model, based on information entropy 
dtc.fit (X, Y) # training model 

# introduced correlation function, the visual tree. 
# Export the result is a dot file, you need to install Graphviz to convert it to a format such as pdf or png. 
from sklearn.tree Import export_graphviz 
X = pd.DataFrame (X)
 from sklearn.externals.six Import StringIO
x = pd.DataFrame(x)
with open("tree.dot", 'w') as f:
  f = export_graphviz(dtc, feature_names = x.columns, out_file = f)

Guess you like

Origin www.cnblogs.com/Iceredtea/p/12056794.html