Machine learning_decision tree and random forest

One, decision tree

1.1 The basis of decision tree division

Information gain

Information entropy:

Insert picture description here

Information gain

The information gain g(D,A) of feature A to training data set D is defined as the difference between the information entropy H(D) of set D and the information conditional entropy H(D|A) of D under the given condition of feature A.
Insert picture description here

Calculation:

Insert picture description here

1.2 Use of Decision Tree Estimator

"""
决策树对泰坦尼克号进行预测生死
:return: None
"""
# 获取数据
titan = pd.read_csv("http://biostat.mc.vanderbilt.edu/wiki/pub/Main/DataSets/titanic.txt")

# 处理数据,找出特征值和目标值
x = titan[['pclass', 'age', 'sex']]
y = titan['survived']
print(x)

# 缺失值处理
x['age'].fillna(x['age'].mean(), inplace=True)

# 分割数据集到训练集合测试集
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.25)

 # 因为原数据类型太杂了,有字符串啥的,处理成字典类型
dict = DictVectorizer(sparse=False)
x_train = dict.fit_transform(x_train.to_dict(orient="records"))
print(dict.get_feature_names())
x_test = dict.transform(x_test.to_dict(orient="records"))
print(x_train)

#用决策树进行预测
dec = DecisionTreeClassifier()
dec.fit(x_train, y_train)

# 预测准确率
print("预测的准确率:", dec.score(x_test, y_test))

# 导出决策树的结构
export_graphviz(dec, out_file="./decision_tree.dot", feature_names=['年龄', 'pclass=1st', 'pclass=2nd', 'pclass=3rd', '女性', '男性'])

1.3 Advantages and disadvantages of decision trees

Insert picture description here

Second, random forest

2.1 Random forest content

principle

In machine learning, a random forest is a classifier that contains multiple decision trees, and the output category is determined by the mode of the category output by the individual tree. Use the same training number to build multiple independent classification models, and then make the final classification decision based on the principle of minority obeying the majority through voting. For example, if you train 5 trees, 4 of them are True, and 1 of them is False, then the final result will be True.

use

Use N to represent the number of training cases (samples), and M to represent the number of features.
Enter the number of features m, which is used to determine the decision result of a node on the decision tree; where m should be much smaller than M.
From N training use cases (samples), take samples with replacement sampling N times to form a training set (ie bootstrap sampling), and use unselected use cases (samples) to make predictions and evaluate the error.
For each node, m features are randomly selected, and the decision of each node on the decision tree is determined based on these features. According to these m characteristics, calculate the best split method.

2.2 Random forest parameters and hyperparameters

class sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’,
 max_depth=None, bootstrap=True, random_state=None)

随机森林分类器

n_estimators:integer,optional(default = 10) 森林里的树木数量

criteria:string,可选(default =“gini”)分割特征的测量方法

max_depth:integer或None,可选(默认=无)树的最大深度 

bootstrap:boolean,optional(default = True)是否在构建树时使用放回抽样 

2.3 Use of Random Forest & Grid Search

# 随机森林进行预测 (超参数调优)
rf = RandomForestClassifier()
param = {
    
    "n_estimators": [120, 200, 300, 500, 800, 1200], "max_depth": [5, 8, 15, 25, 30]}

# 网格搜索与交叉验证
gc = GridSearchCV(rf, param_grid=param, cv=2)
gc.fit(x_train, y_train)
print("准确率:", gc.score(x_test, y_test))
print("查看选择的参数模型:", gc.best_params_)

2.4 Advantages

在当前所有算法中,具有极好的准确率
能够有效地运行在大数据集上
能够处理具有高维特征的输入样本,而且不需要降维
能够评估各个特征在分类问题上的重要性
对于缺省值问题也能够获得很好得结果

Guess you like

Origin blog.csdn.net/tjjyqing/article/details/114044163