sklearn of the decision tree

'' ' 
Decision tree:
the basic principles of algorithms:
the core idea: similar input will produce similar output. E.g. pay someone Prediction:
Age: 1- young, middle-aged 2-, 3- elderly
Education: Bachelor 1-, 2- master, Dr. 3-
experience: debut 1-, 2- general, veteran 3-, 4- ashes
gender: 1 - male, 2 female

| Age | education | experience | sex | ==> | salary |
| ---- | ---- | ---- | ---- | ---- | ----------- |
|. 1 |. 1 |. 1 |. 1 | ==> | 6000 (low) |
| 2 |. 1 |. 3 |. 1 | ==> | 10000 (in) |
|. 3 | 3 | 4 | 1 | == > | 50000 ( high) |
| ... | ... | ... | ... | ==> | ... |
? | 1 | 3 | 2 | 2 | ==> | |

In order to improve search efficiency, the use of tree data structure processing the sample data: first by age three sub-sub-tree, and then points down according to education, known until all features are divided End
this will give the child table leaf, the leaf level of the child table will hold all the eigenvalues of exactly the same samples.

Selecting a first feature from the first training sample matrix sub-table is divided, each sub-table of the values of all the same features,
and then select the next feature in each sub-table under the same rules continue to divide smaller subtable , is repeated until all used up all of the features so far,
this case will give the sub-leaf table, wherein the characteristic values of all the samples are all the same. For the samples to be predicted, according to a feature value of each of which selects a corresponding sub-table,
one by one match until leaf subtable found with exact match, the output of the sub-samples in the table, the average (regression) or vote (classification) to provide an output sample to be predicted.
With the division of the sub-table, the information entropy (degree of information chaos) getting smaller and smaller, more and more pure information, data, more and more orderly.

Decision tree regression model related API:
Import sklearn.tree AS ST
maximum depth regression model # to create a decision tree decision tree is 4
Model = st.DecisionTreeRegressor (MAX_DEPTH = 4)
# training model
# Train_x: two-dimensional array of sample data
# train_y: the results of the training set corresponding to each row of samples
model.fit (train_x, train_y)
# test model
pred_test_y = model.predict (test_x)

decision tree model optimization:
1. Engineering Optimization: Do not use All features do, the child table leaf allowed mixed different characteristic values, thereby reducing the number of layers of the tree, at the expense of acceptable given accuracy,
improve the performance of the model. In general, preference may reduce the amount that the maximum entropy as a basis for dividing the feature sub-table.
2. collection algorithm: The results of a plurality of different prediction models presented, using the average (regression) or a voting method (classification), and obtain the final prediction result. Based on a set of decision tree algorithm,
that is, according to certain rules, and more trees to build a decision tree model different from each other, are given for the prediction of unknown samples,
and finally get a relatively comprehensive conclusions average or vote. --- a tree-sided, multi-tree together, generalization model
1> positive incentives: First, the sample matrix samples randomly assigned initial weights, thereby constructing a decision tree with a heavy right to,
by the when the decision tree provide predictive output, generating a prediction value obtained by a weighted average or a weighted voting embodiment.
The training samples into the model, to predict the output of the predicted and actual values of those different sample, increase its weight,
thereby forming a second decision trees. Repeat the process to construct the different weights of several decision trees. --- a tree-sided, multi-tree together, generalization model
positive incentives related API:
Import sklearn.tree AS ST
Import sklearn.ensemble collection algorithm module AS SE #
# model: decision tree model (a), That single decision trees model
model = st.DecisionTreeRegressor (MAX_DEPTH = 4)
# enhanced adaptive decision tree regression model
# n_estimators: Construction of 400 different weights tree (how many trees), training model
model = se.AdaBoostRegressor ( model, n_estimators = 400, random_state = 7)
# training model
model.fit (train_x, train_y)
# test model
pred_test_y = model.predict (test_x)


Case: predict the Boston area home prices.
Steps:
1. Read data, disrupting the original data set, test and training set is divided
2.
'' '
Import numpy AS NP
Import matplotlib.pyplot AS MP
Import sklearn.tree AS ST
provides import sklearn.datasets as sd # sklearn of dataset
import sklearn.utils as su # datasets can be disrupted by row
Import sklearn.metrics AS SM
Import sklearn.ensemble AS SE

# load Boston houses in house prices
boston sd.load_boston = ()
# [ 'CRIM', 'ZN', 'INDUS', 'CHAS', 'of NOX', 'RM', 'of AGE', 'the DIS', 'the RAD', 'of TAX', 'PTRATIO', 'B', 'LSTAT']
# [ 'crime', 'residential proportion', 'proportion of commercial land,' 'is by the river,' '

input print (boston.data.shape) # data
output print (boston.target.shape) # data

# test set and the training set is divided two eighth ---, 80% for training, 20% for Test
# random_state called random seed, the same random seed used if disrupted, then the result of the same
x, y = su.shuffle (boston.data, boston.target, random_state = 7) # disrupted by row data set
train_size = int (len (X) * 0.8)
train_x, test_x, train_y, test_y X = [: train_size], X [train_size:], Y [: train_size], Y [train_size:]
Print (train_x.shape)
Print (test_x. the Shape)

# Decision tree based modeling -> training model -> test model ----- single decision trees
model = st.DecisionTreeRegressor (MAX_DEPTH = 4)
model.fit (train_x, train_y)
pred_test_y = model.predict (test_x)

# model assessment --- single decision trees
Print (sm.r2_score (test_y, pred_test_y))
Print ( '======================= ')

# based on positive incentives predict house prices
= se.AdaBoostRegressor Model (Model, n_estimators = 400, random_state = 7)
model.fit (train_x, train_y)
pred_test_y = model.predict (test_x)
# of positive incentive model score
print (sm.r2_score (test_y, pred_test_y) )


output results:
[ 'CRIM' 'ZN' 'INDUS' 'CHAS' 'of NOX' 'RM' 'of AGE' 'the DIS' 'the RAD' 'of TAX' 'PTRATIO'
 'B' 'LSTAT']
(506, 13 is)
( 506,)
(404, 13)
(102, 13)
0.8202560889408635
=======================
0.9068598725149652

Guess you like

Origin www.cnblogs.com/yuxiangyang/p/11184003.html