Decision tree API, Titanic survival prediction case

One, decision tree API

Use sklearn.tree.DecisionTreeClassifier(criterion='gini', max_depth=None, random_state=None) to build a decision tree in sklearn

among them:

  • criterion
    • Feature selection criteria
    • "gini" or "entropy", the former represents the Gini coefficient, and the latter represents information gain. A default "gini", that is, the CART algorithm.
  • min_samples_split
    • Minimum number of samples required for subdividing internal nodes
    • This value limits the conditions for the continued division of the subtree. If the number of samples of a node is less than min_samples_split, it will not continue to try to select the optimal feature for division. The default is 2. If the sample size is not large, you do not need to control this value. If the sample size is very large, it is recommended to increase this value. An example of my previous project had about 100,000 samples. When building the decision tree, I chose min_samples_split=10. Can be used as a reference.
  • min_samples_leaf
    • Minimum number of samples for leaf nodes
    • This value limits the minimum number of samples of leaf nodes. If the number of a leaf node is less than the number of samples, it will be pruned together with its sibling nodes. The default is 1, you can enter the integer of the minimum number of samples, or the minimum number of samples as a percentage of the total number of samples. If the sample size is not large, you do not need to control this value. If the sample size is very large, it is recommended to increase this value. The value of min_samples_leaf used in the previous 100,000 sample projects is 5, which is for reference only.
  • max_depth
    • Maximum depth of decision tree
    • The maximum depth of the decision tree can not be entered by default. If it is not entered, the decision tree will not limit the depth of the subtree when it builds the subtree. Generally speaking, this value can be ignored when there are few data or features. If the model has a large sample size and many features, it is recommended to limit this maximum depth. The specific value depends on the distribution of the data. Commonly used values ​​can be between 10-100
  • random_state
    • Random seed

2. Titanic survival prediction case

Code:

import pandas as pd
import numpy as np
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction import DictVectorizer


# 数据获取
titan_data = pd.read_csv('./titan/train.csv')

# 数据预处理
x = titan_data[["Pclass", "Sex", "Age"]]
y = titan_data["Survived"]
# 填充空值
x['Age'].fillna(x['Age'].mean(),inplace=True)
# 数据分割
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)

# 将pclass和性别进行特征提取,也就是转换为数字
transfer = DictVectorizer(sparse=True)
x_train = transfer.fit_transform(x_train.to_dict(orient="records"))
x_test = transfer.fit_transform(x_test.to_dict(orient="records"))

# 机器学习
estimator = DecisionTreeClassifier(criterion="entropy",max_depth=13)
estimator.fit(x_train, y_train)

# 模型评估
estimator.score(x_test,y_test)

 

The training set data comes from the kaggle platform:  https://www.kaggle.com/c/titanic/overview

Guess you like

Origin blog.csdn.net/qq_39197555/article/details/115331307