Decision tree [Brief summary of machine learning notes]

Decision Tree is a non-parametric supervised learning method that can summarize decision rules from a series of data with features and labels, and present these rules in a dendrogram structure to solve classification and regression problems question. Decision tree algorithms are easy to understand, applicable to various data, and perform well in solving various problems. In particular, various integrated algorithms with tree models as the core are widely used in various industries and fields.

Classification basis

information gain

In 1948, Shannon proposed the concept of information entropy (Entropy).
If the classification of event A is (A1, A2,...,An), and the probability of each part occurring is (p1, p2,..., pn), then the information entropy is defined as the following formula: (log is base 2, lg (base 10)
insert image description here1. Entropy
​ is used to measure the degree of order of an object.
The more orderly the system, the lower the entropy value; the more chaotic or dispersed the system, the higher the entropy value.
2. Information entropy
●Description from the completeness of information:
When the ordered state of the system is consistent, the entropy value is smaller where the data is more concentrated, and the entropy value is larger where the data is more dispersed.
●Description from the orderliness of information:
When the amount of data is consistent, the more orderly the system, the lower the entropy value; the more chaotic or dispersed the system, the higher the entropy value.
Information gain: the difference in entropy before and after dividing the data set by a certain feature. Entropy can represent the uncertainty of a sample set. The greater the entropy, the greater the uncertainty of the sample. Therefore , the difference in set entropy before and after division can be used to measure the effectiveness of using the current features to divide the sample set D.
Information gain = entroy (before) - entroy (after)
The information gain g(D,A) of feature A on training data set D, defined as the information entropy H(D) of set D and the information of D under given conditions of feature A The difference in conditional entropy H(D|A), that is, the formula is:
g ( D , A ) = H ( D ) − H ( D ∣ A ) g(D,A)=H(D)-H(D|A )g(D,A)=H(D)Detailed explanation of the H ( D | A )
formula:
insert image description here
Case:
As shown on the left below, the first column is the forum number, the second column is the gender, the third column is the activity level, and the last column is whether the user has been lost.
We need to solve a problem: Which of the two characteristics, gender and activity, has a greater impact on user churn?
insert image description here
This problem can be solved by calculating the information gain. Statistically, the information in the table on the right
is positive samples (already lost), Negative is negative samples (not lost), and the following values ​​​​are the corresponding number of people under different divisions.
Three entropies can be obtained:
Overall entropy:
insert image description here
Gender entropy:
insert image description here
Gender information gain:
insert image description here
Activity entropy:
insert image description here
Activity information gain:
insert image description hereThe information gain of activity is greater than the information gain of gender. In other words, the impact of activity on user churn is greater than that of gender. big.

When doing feature selection or data analysis, we should focus on the activity indicator.

information gain rate

**Gain rate: **The gain ratio measure is jointly defined by the ratio of the previous gain measure Gain(S,A) and the separated information measure SplitInformation (such as gender, activity, etc. in the above example).
insert image description here
Case:
As shown in the table below: What conditions must be met before playing golf?
insert image description here
Overall information entropy: H ( Y ) = − 5 14 log ⁡ 2 ( 5 14 ) − 9 14 log ⁡ 2 ( 9 14 ) = 0.9403 H(Y)=-\frac{5}{14}\log_2(\frac {5}{14})-\frac{9}{14}\log_2(\frac{9}{14})=0.9403H(Y)=145log2(145)149log2(149)=0.9403

Come over any piece of weather data, predict whether to play golf or not, set it as a random variable Y.
Come over any piece of weather data. The weather (Outlook) is sunny (sunny), cloudy (overcast) or rainy (rain), and set it as a random variable X.
If any piece of weather data comes in and it is sunny, set it as event x1;
if any piece of weather data comes in and it is cloudy, set it as event x2;
if any piece of weather data comes in and it is raining, set it as event x3;
the probability space of random variable X is:
insert image description here

(1) Entropy of clear weather conditions: H ( Y ∣ \frac{3}{8}\log_2(\frac{3}{8})-\frac{5}{8}\log_2(\frac{5}{8})=0.9544H(YX=x1)=83log2(83)85log2(85)=0.9544

(2) Cloudy weather condition entropy: H ( Y ∣ X = x 2 ) = − 4 4 log ⁡ 2 ( 4 4 ) − 0 4 log ⁡ 2 ( 0 4 ) = 0 H(Y| \frac{4}{4}\log_2(\frac{4}{4})-\frac{0}{4}\log_2(\frac{0}{4})=0H(YX=x2)=44log2(44)40log2(40)=0

(3) Entropy of rainy weather conditions: H ( Y ∣ -\frac{2}{2}\log_2(\frac{2}{2})-\frac{0}{2}\log_2(\frac{0}{2})=0H(YX=x3)=22log2(22)20log2(20)=0

(4) Weather condition entropy: H ( Y ∣ X ) = ∑ x ∈ X p ( x ) H ( Y ∣ X ) H(Y|X)=\displaystyle\sum_{x\in (Y|X)H(YX)=xXp(x)H(YX) = p ( x 1 ) H ( Y ∣ X = x 1 ) + p ( x 2 ) H ( Y ∣ X = x 2 ) + p ( x 3 ) H ( Y ∣ X = x 3 ) = 8 14 × 0.9544 + 4 14 × 0 + 2 14 × 0 = 0.5454 =p(x_1)H(Y|X=x_1)+p(x_2)H(Y|X=x_2)+p(x_3)H(Y|X=x_3)=\frac{8}{14}×0.9544+\frac{4}{14}×0+\frac{2}{14}×0=0.5454 =p(x1)H(YX=x1)+p(x2)H(YX=x2)+p(x3)H(YX=x3)=148×0.9544+144×0+142×0=0.5454

(5) Weather information gain: g ( Y , X ) = H ( Y ) − H ( Y |g(Y,X)=H(Y)H(YX)=0.3949

information gain rate

(1) Internal information of Day date: I nt I ( D , D ay ) = 14 × ( − 1 14 × log ⁡ 2 ( 1 14 ) ) = 3.8074 IntI(D,Day)=14×(-\frac{ 1}{14}×\log_2(\frac{1}{14}))=3.8074I n t I ( D ,Day)=14×(141×log2(141))=3.8074

(2) Internal information of Outlook weather: I nt I ( D , O outlook ) = − 8 14 log ⁡ 2 ( 8 14 ) − 4 14 log ⁡ 2 ( 4 14 ) − 2 14 log ⁡ 2 ( 2 14 ) = 1.3788 IntI(D,Outlook)=-\frac{8}{14}\log_2(\frac{8}{14})-\frac{4}{14}\log_2(\frac{4}{14}) -\frac{2}{14}\log_2(\frac{2}{14})=1.3788I n t I ( D ,Outlook)=148log2(148)144log2(144)142log2(142)=1.3788

(3) Gain rate of Day date information: g ( D ∣ D ay ) = g ( D , D ay ) I nt I ( D , D ay ) = 0.9403 3.8074 = 0.247 g(D|Day)=\frac{g (D,Day)}{IntI(D,Day)}=\frac{0.9403}{3.8074}=0.247g(DDay)=IntI(D,Day)g(D,Day)=3.80740.9403=0.247

(4)Outlook is the dependent variable: g ( D ∣ O outlook ) = g ( D , O outlook ) I nt I ( D , O outlook ) = 0.3949 1.3788 = 0.2864 g(D|Outlook)=\frac{g (D,Outlook)}{IntI(D,Outlook)}=\frac{0.3949}{1.3788}=0.2864g(DOutlook)=IntI(D,Outlook)g(D,Outlook)=1.37880.3949=0.2864

Gini value and Gini index

**Gini value Gini (D): **The probability that two samples are randomly selected from the data set D and their class labels are inconsistent. Therefore, the smaller the Gini(D) value, the higher the purity of the data set D.
insert image description here
**Gini_index (D): **Generally, the attribute that minimizes the Gini coefficient after division is selected as the optimal sub-attribute.
insert image description here
Case
Please make a decision tree based on the list below and the division basis of the Gini index.
insert image description here
1. Calculate their Gini coefficient gains respectively for the non-class label attributes of the data set {whether you own a house, marital status, annual income}, and take the attribute with the largest Gini coefficient gain value as the root node attribute of the decision tree.
2. The Gini coefficient of the root node is:
insert image description here
3. When divided according to whether there is a house, the Gini coefficient gain calculation process is:
insert image description here
4. If divided according to the marital status attribute, the attribute marital status has three possible values ​​{married, single, divided}, respectively calculate the Gini coefficient gain after division.
{married} | {single,divorced}
{single} | {married,divorced}
{divorced} |
{single,married} is grouped into
insert image description here{married} | divorced}:
insert image description here
When the group is {divorced} | {single, married}:
insert image description here
Compare the calculation results, and when dividing the root node according to the marital status attribute, take the group with the largest Gini coefficient gain as the division result, that is: {married} | {single ,divorced}

5. Annual income Gini can be obtained in the same way:
For annual income attributes that are numerical attributes, you first need to sort the data in ascending order, and then divide the sample into two groups from small to large using the middle value of adjacent values ​​as a separation. For example, when faced with two values ​​of annual income of 60 and 70, we calculated that the middle value is 65. Use the middle value 65 as the dividing point to find the Gini coefficient gain.
insert image description here
Maximizing the gain is equivalent to minimizing the weighted average of the impurity measures (Gini coefficients) of the child nodes . Now we want to maximize the gain of the Gini coefficient. According to calculations, it is known that among the three attributes that divide the root node, there are two that have the largest gain: annual income attribute and marital status. Their gains are both 0.12. At this time, the attribute that appears first is selected as the first division.

6. Next, use the same method to calculate the remaining attributes respectively. The Gini coefficient of the root node is (at this time, there are 3 records for whether the loan is in arrears). 7.
insert image description here
For the attribute of whether there is a house, we can get:
insert image description here
8. For The annual income attributes are:
insert image description here
insert image description here

summary

1. Information gain
Information gain = entroy (before) - entroy (after)
Note: The greater the information gain, we give priority to this attribute for calculation.
Information gain prioritizes the attributes with more total categories for classification
. 2. The information gain rate
maintains a Separate information measure, use this separated information measure as the denominator to limit
3. Gini gain
●. Gini value:
Two samples are randomly selected from the data set D. The smaller the probability
Gini (D) value of the inconsistent class labels of the data set, the smaller the Gini (D) value of the data set. The higher the purity of D.
●Gini index:
Select the attribute with the smallest Gini coefficient after division as the optimal sub-attribute
●Gini gain:
Select the point with the largest Gini gain for optimal division

The three corresponding algorithms

insert image description here

ID3 algorithm

Existing shortcomings (1) The ID3 algorithm uses information gain as the evaluation criterion
when selecting branch attributes in the root node and each internal node . The disadvantage of information gain is that it tends to select attributes with more values, and in some cases such attributes may not provide much valuable information.

(2) The ID3 algorithm can only construct decision trees for data sets whose description attributes are discrete attributes .

C4.5 algorithm

Improvements made (why it is better to use C4.5)
(1) Use information gain rate to select attributes
(2) Can handle continuous numerical attributes
(3) Use a post-pruning method
(4) Handle missing values Advantages and
Disadvantages of C4.5 Algorithm Advantages
:
The classification rules generated are easy to understand and have high accuracy.

Disadvantages:
In the process of constructing the tree, the data set needs to be scanned and sorted multiple times, resulting in inefficiency of the algorithm.

In addition, C4.5 is only suitable for data sets that can reside in memory. When the training set is too large to be accommodated in memory, the program cannot run.

CART algorithm

Compared with the classification method of the C4.5 algorithm, the CART algorithm uses a simplified binary tree model, and the feature selection uses the approximate Gini coefficient to simplify the calculation.

C4.5 is not necessarily a binary tree, but CART must be a binary tree.

At the same time, whether it is ID3, C4.5 or CART, when making feature selection, the optimal feature is selected to make classification decisions, but most classification decisions should not be determined by a certain feature, but should be determined by a set of characteristics. **The decision tree obtained by this decision is more accurate. This decision tree is called a multi-variate decision tree. When selecting the optimal feature, the multivariable decision tree does not select a certain optimal feature, but selects the optimal linear combination of features to make a decision. The representative of this algorithm is OC1, which will not be introduced here.

If the sample changes even slightly, it will lead to drastic changes in the tree structure. This can be solved through methods such as random forest in ensemble learning.

Classification tree algorithm api

API official website link

class sklearn.tree.DecisionTreeClassifier(*, criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, class_weight=None, ccp_alpha=0.0)

Parameters:
Criterion impurity calculation method
Enter "entropy", use information entropy (Entropy)
to enter "gini", use Gini coefficient (Gini Impurity)

Compared with the Gini coefficient, information entropy is more sensitive to impurity and has the strongest penalty for impurity. However, in actual use, the effects of information entropy and Gini coefficient are basically the same . The calculation of information entropy and bikini coefficient is slower because the calculation of Gini coefficient does not involve logarithms. In addition, because information entropy is more sensitive to impurity, when information entropy is used as an indicator, the growth of the decision tree will be more "fine". Therefore, for high-dimensional data or data with a lot of noise, information entropy can easily overfit, and the Gini coefficient is In this case, the effect is often better. When the model fit is insufficient, that is, when the model performs poorly on both the training set and the test set, information entropy is used. Of course, these are not absolutes.

parameter criterion
How does it affect the model? Determine the calculation method of impurity to help find the best node and best branch. The lower the impurity, the better the decision tree fits the training set.
How to choose parameters? Usually the Gini coefficient is used when
the data dimension is very large. When the noise is very large, the Gini coefficient is used
when the dimension is low. When the data is relatively clear, there is no difference between the information entropy and the Gini coefficient
. When the fitting degree of the decision tree is not enough, use the information entropy
two. Try them both, if it doesn’t work, switch to another one.

random_state & splitter random number seed
random_state random number seed, randomness will be more obvious in high dimensions. In low-dimensional data (such as iris data set), randomness will hardly appear.

The splitter is also used to control the random options in the decision tree. There are two input values. Enter "best". Although the decision tree is random when branching, it will still give priority to more important features for branching (the importance can be passed Attribute feature_importances_view), enter "random", the decision tree will be more random when branching, the tree will be deeper and larger because it contains more unnecessary information, and the fitting of the training set will be reduced due to these unnecessary information. . This is also a way to prevent overfitting. When you predict that your model will overfit, use these two parameters to help you reduce the possibility of overfitting after the tree is built. Of course, once the tree is built, we still use pruning parameters to prevent overfitting.

max_depth is the maximum depth of the decision tree. All branches exceeding the set depth will be pruned
. If the decision tree grows one more layer, the demand for sample size will double, so limiting the tree depth can effectively limit overfitting. It is also very practical in ensemble algorithms. In actual use, it is recommended to try starting from =3 to see the fitting effect before deciding whether to increase the setting depth.

min_samples_leaf & min_samples_split The minimum number of samples of leaf nodes & the minimum number of samples required for re-dividing internal nodes.
min_samples_split This value limits the conditions for continued subtree division. If the number of samples of a node is less than min_samples_split, it will not continue to try to select the optimal one. characteristics to classify. The default is 2. If the sample size is not large, there is no need to control this value. If the sample size is of very large magnitude, it is recommended to increase this value. An example of my previous project had about 100,000 samples. When building a decision tree, I chose min_samples_split=10. Can be used as a reference.

The value min_samples_leaf limits the minimum number of samples for leaf nodes. If the number of a leaf node is less than the number of samples, it will be pruned together with its sibling nodes. The default is 1, you can enter the integer of the minimum number of samples, or the minimum number of samples as a percentage of the total number of samples. If the sample size is not large, there is no need to worry about this value. If the sample size is of very large magnitude, it is recommended to increase this value. The previous 100,000 sample projects used a min_samples_leaf value of 5 for reference only.

max_features & min_impurity_decrease
are generally used with max_depth for "refining" the tree

max_features limits the number of features considered when branching. Features that exceed the limit will be discarded. Similar to max_depth, max_features is a pruning parameter used to limit over-fitting of high-dimensional data, but its method is more violent. It is a parameter that directly limits the number of features that can be used and forcibly stops the decision tree. If you don’t know the decision tree, Given the importance of each feature in the model, forcibly setting this parameter may lead to insufficient model learning. If you want to prevent overfitting through dimensionality reduction, it is recommended to use PCA, ICA or the dimensionality reduction algorithm in the feature selection module.

min_impurity_decrease limits the size of the information gain. Branches with information gain less than the set value will not occur. This is an updated feature in version 0.19, before version 0.19 min_impurity_split was used.

class_weight & min_weight_fraction_leaf target weight parameter
attribute is the various properties of the model that can be called and viewed after model training. For decision trees, the most important thing is feature_importances_, which allows you to view the importance of each feature to the model. The most commonly used interfaces for decision trees are apply and predict. The input test set in apply returns the index of the leaf node where each test sample is located, and the input test set in predict returns the label of each test sample.

In all interfaces that require the input of X_train and X_test , the input feature matrix must be at least a two-dimensional matrix. sklearn does not accept any one-dimensional matrix as input as a feature matrix . If your data does have only one feature, you must use **reshape(-1,1)** to add dimension to the matrix; if your data has only one feature and one sample, use reshape(1,-1) to Add dimensionality to your data.

Classification tree attribute list
insert image description here
Classification tree interface list
insert image description here
summary :
the basic process of the decision tree, the eight parameters of the classification tree, one attribute, four interfaces, and the code used for drawing.
Eight parameters: Criterion, two randomness-related parameters (random_state, splitter), five pruning parameters (max_depth, min_samples_split, min_samples_leaf, max_feature, min_impurity_decrease) one attribute: feature_importances_ four interfaces: fit, score, apply
,
predict

Red wine data set case

from sklearn import tree
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
import  matplotlib.pyplot as plt
import graphviz# 需要提前安装graphviz
wine = load_wine()
X_train, X_test, Y_train, Y_test = train_test_split(wine.data,wine.target,test_size=0.3)
clf = tree.DecisionTreeClassifier(criterion="entropy")
clf = clf.fit(X_train, Y_train)
score = clf.score(X_test, Y_test) #返回预测的准确度accuracy
print(score)
print(wine.feature_names)
print(wine.target_names)

feature_name = ['酒精','苹果酸','灰','灰的碱性','镁','总酚','类黄酮','非黄烷类酚类','花青素','颜色强度','色调','od280/od315稀释葡萄酒','脯氨酸']

dot_data = tree.export_graphviz(clf
                                ,feature_names = feature_name  # 特征名
                                ,class_names=["琴酒","雪莉","贝尔摩德"] #标签的名称,这里是自定义的
                                ,filled=True  # 颜色填充
                                ,rounded=True # 圆角边框
                               ) 
graph = graphviz.Source(dot_data)
graph

insert image description here

# random_state & splitter
clf = tree.DecisionTreeClassifier(criterion="entropy"
                                  ,random_state=30
                                  ,splitter="random"
                                 )
clf = clf.fit(X_train, Y_train)
score = clf.score(X_test, Y_test)
print(score)
#剪枝参数
clf = tree.DecisionTreeClassifier(criterion="entropy"
                                  ,random_state=30
                                  ,splitter="random"
                                  ,max_depth=3
                                  ,min_samples_leaf=10 # 将样本数量小于10的叶子节点剪掉
                                  ,min_samples_split=10 # 将中间节点样本数量小于10的剪掉
                                 )
clf = clf.fit(X_train, Y_train)
dot_data = tree.export_graphviz(clf
                                ,feature_names = feature_name 
                                ,class_names=["琴酒","雪莉","贝尔摩德"]
                                ,filled=True
                                ,rounded=True
                               ) 
graph = graphviz.Source(dot_data)
graph

insert image description here

# 确认最优参数,画学习曲线
deths_rt = []

for dep in range(1, 10):
    clf = tree.DecisionTreeClassifier(criterion="entropy"
                                      ,max_depth = dep
                                      ,random_state=30
                                      ,splitter="random"
                                      
                                     )
    clf = clf.fit(X_train, Y_train)
    score = clf.score(X_test, Y_test)  # 返回准确度
    deths_rt.append(score)
plt.plot(range(1, 10), deths_rt)

insert image description here

# 将特征名称与重要性对应
dict(zip(wine.feature_names, clf.feature_importances_))  
# 返回样本所在叶子节点的索引
clf.apply(X_test)

Regression tree algorithm api

class sklearn.tree.DecisionTreeRegressor(*, criterion='squared_error', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_decrease=0.0, ccp_alpha=0.0)

In regression trees, there is no question of whether the label distribution is balanced, so there is no parameter such as class_weight.

Parameter description:
The regression tree is an indicator for measuring the quality of branches. There are three supported standards:
1) Enter "mse" and use the mean squared error (MSE). The difference in the mean squared error between the parent node and the leaf node will be Used as a criterion for feature selection, this method minimizes the L2 loss by using the mean value of leaf nodes.

2) Enter "friedman_mse" to use Feldman mean square error. This metric uses Friedman's modified mean square error for problems in potential branches.

3) Enter "mae" to use the absolute mean error MAE (mean absolute error). This indicator uses the median value of the leaf node to minimize the L1 loss. The most important attribute is still feature_importances_, and the interface is still apply,
fit, predict, score. The core.

The regression tree interface score returns R-squared, not MSE.

Advantages and Disadvantages of Decision Trees

Decision tree advantages

  1. Easy to understand and explain because trees can be drawn and seen

  2. Requires little data preparation. Many other algorithms often require data normalization, creating dummy variables and removing null values, etc. But please note that the decision tree module in sklearn does not support the handling of missing values.

  3. The cost of using a tree (say, when predicting data) is the logarithm of the number of data points used to train the tree, which is a very low cost compared to other algorithms.

  4. Able to handle both numerical and categorical data, and can perform both regression and classification. Other techniques are often specialized for analyzing data sets with only one variable type.

  5. Able to handle multi-output problems, that is, problems containing multiple labels, please note that they are distinguished from problems containing multiple label classifications in one label.

  6. It is a white box model and the results are easily interpretable. If a given situation can be observed in the model, the conditions can be easily explained through Boolean logic. In contrast, in black-box models (e.g., in artificial neural networks), the results may be more difficult to interpret.

  7. Models can be validated using statistical tests, which allow us to consider the reliability of the model.

  8. It can perform well even if its assumptions violate to some extent the real model that generated the data.

Disadvantages of decision trees

  1. Decision tree learners may create overly complex trees that do not generalize well to the data. This is called overfitting. Mechanisms such as pruning, setting the minimum number of samples required for a leaf node, or setting the maximum depth of the tree are necessary to avoid this problem, and the integration and adjustment of these parameters can be obscure to beginners

  2. Decision trees can be unstable, and small changes in the data can lead to completely different trees. This problem needs to be solved by ensemble algorithms.

  3. The learning of decision trees is based on the greedy algorithm, which tries to achieve the overall optimum by optimizing the local optimum (the optimum of each node), but this approach cannot guarantee that the global optimal decision tree will be returned. This problem can also be solved by ensemble algorithms. In random forests, features and samples are randomly sampled during the branching process.

  4. Some concepts are difficult to learn because decision trees do not express them easily, such as XOR, parity, or the multiplexer problem.

  5. If certain classes in a label are dominant, the decision tree learner creates trees that are biased toward the dominant classes. Therefore, it is recommended to balance the data set before fitting the decision tree.

practise

Titanic passenger survival predictions

The Titanic and titanic2 data frames describe the survival status of individual passengers on the Titanic. The dataset used here was started by various researchers. Included are passenger lists created by many researchers and edited by Michael A. Findlay. The features in the dataset we extracted are ticket category, survival, class, age, login, home.dest, room, ticket, boat and gender.

Data: http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt

After observing the data, we get:
1. The passenger class refers to the passenger class (1, 2, 3), which is a representative of the socioeconomic class.
2 The age data is missing.

import pandas as pd
import numpy as np
from sklearn import tree
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier, export_graphviz

# 1、获取数据
titan = pd.read_csv("titanic.csv")

#2、数据基本处理
#2.1 确定特征值,目标值
x = titan[["pclass", "age", "sex"]]
y = titan["survived"]

#2.2 缺失值处理
x['age'].fillna(x['age'].mean(), inplace=True)

#2.3 数据集划分
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)


#3.特征工程(字典特征抽取)
# 对于x转换成字典数据x.to_dict(orient="records")
# [{"pclass": "1st", "age": 29.00, "sex": "female"}, {}]
transfer = DictVectorizer(sparse=False)
x_train = transfer.fit_transform(x_train.to_dict(orient="records"))
x_test = transfer.fit_transform(x_test.to_dict(orient="records"))


# 4.机器学习(决策树)
estimator = DecisionTreeClassifier(criterion="entropy", max_depth=5)
estimator.fit(x_train, y_train)


# 5.模型评估
estimator.score(x_test, y_test)

estimator.predict(x_test)

Guess you like

Origin blog.csdn.net/qq_45694768/article/details/120754428