Random Forest [Abstract of Machine Learning Notes]

In machine learning, a random forest is a classifier that contains multiple decision trees , an ensemble algorithm, and its output category is determined by the mode of the category output by the individual trees.

Random Forest = Bagging + Decision Tree

Bagging integration principle

Bagging integration process
1. Sampling: sample a part of all samples
2. Learning: train weak learners
3. Integration: use equal rights voting

Example: Classify the following circles and squares
insert image description here
. Implementation process:
1. Sampling different data sets
insert image description here2. Training classifier
insert image description here
3. Equal voting to obtain the final result
insert image description here
4. Summary of the main implementation process
insert image description here

Random forest construction process

Random forest construction process
For example, if you train 5 trees, and the result of 4 trees is True and the result of 1 tree is False, then the final voting result is True

The key steps in the random forest construction process (use N to represent the number of training cases (samples), M to represent the number of features):
1) Randomly select one sample at a time, sample with replacement, repeat N times (possibly Duplicate samples appear)

2) Randomly select m features, m <<M, and build a decision tree

  • think
    • 1. Why randomly sample the training set?  
      If random sampling is not performed and the training set of each tree is the same, then the final trained tree classification results will be exactly the same.

    • 2. Why is there sampling with replacement?
      If there is no sampling with replacement, then the training samples of each tree are different and have no intersection. In this way, each tree is "biased" and absolutely "one-sided" (of course it is possible to say this Wrong), that is to say, each tree is very different after training; and the final classification of random forest depends on the voting of multiple trees (weak classifiers).

random forest api

There are DecisionTreeClassifier random forest classification and DecisionTreeRegressor random forest regression. DecisionTreeClassifier is introduced here.

class sklearn.ensemble.RandomForestClassifier(n_estimators=100, *, criterion='gini', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, bootstrap=True, oob_score=False, n_jobs=None, random_state=None, verbose=0, warm_start=False, class_weight=None, ccp_alpha=0.0, max_samples=None)[source]
  • n_estimators: integer, optional (default = 100) The number of trees in the forest;

  • Criterion: string, optional (default = "gini") A measure of impurity, there are two options: Gini coefficient and information entropy

  • max_depth: integer or None, optional (default = None) The maximum depth of the tree;

  • max_features="auto", limits the number of features considered when branching. Features exceeding the limit will be discarded. The default value is the square root of the total number of features.

    • If “auto”, then max_features=sqrt(n_features).
    • If “sqrt”, then max_features=sqrt(n_features)(same as “auto”).
    • If “log2”, then max_features=log2(n_features).
    • If None, then max_features=n_features.
  • bootstrap: boolean, optional (default=True) Whether to use sampling with replacement when building trees

  • min_samples_split: The minimum number of samples for node division

  • min_samples_leaf: The minimum number of samples for leaf nodes

  • Hyperparameters: n_estimator,max_depth,min_samples_split,min_samples_leaf

example

import pandas as pd
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.model_selection import train_test_split,GridSearchCV
from sklearn.ensemble import RandomForestClassifier

# 1、获取泰坦尼克号数据集
titan = pd.read_csv('titanic.csv')

#2、数据基本处理
#2.1 确定特征值,目标值
x = titan[['pclass', 'age', 'sex']]
y = titan['survived']

#2.2 缺失值处理
x['age'].fillna(x['age'].mean(), inplace=True)

#2.3 数据集划分
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)


#3.特征工程(字典特征抽取)
# 对于x转换成字典数据x.to_dict(orient="records")
transfer = DictVectorizer(sparse=False)
x_train = transfer.fit_transform(x_train.to_dict(orient='records'))
x_test = transfer.fit_transform(x_test.to_dict(orient='records'))


# 4.机器学习(随机森林)
estimator = RandomForestClassifier()
param_grid={
    
    "n_estimators": [120,200,300,500,800,1200], "max_depth": [5, 8, 15, 25, 30]}
estimator=GridSearchCV(estimator,param_grid=param_grid,cv=3)

estimator.fit(x_train,y_train)

# 5.模型评估
score = estimator.score(x_test, y_test)
print("直接计算准确率:\n", score)
print("在交叉验证中验证的最好结果:\n", estimator.best_score_)
print("调整出来的最佳参数:\n", estimator.best_params_)
print("最好的参数模型:\n", estimator.best_estimator_)
print("每次交叉验证后的准确率结果:\n", estimator.cv_results_)

Random forest regression to fill missing values

The sklearn.impute.SimpleImputer module can easily fill in null values ​​with the mean, median, or other commonly used values. Next, we will perform missing value filling on the Boston housing price data set 均值, and verify the fitting effect under various circumstances to find the best missing value filling method 0.随机森林回归

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.datasets import load_boston
from sklearn.impute import SimpleImputer # 对空值进行
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import cross_val_score

# 获取数据集 --- 共有 506*13=6578 个数据
boston = load_boston()
x_full = boston.data   # 数据集
y_full = boston.target # 标签列
n_samples = x_full.shape[0] # 506行
n_features = x_full.shape[1] # 13列 --- 特征名称

Construct missing values
​​1. First determine the proportion of missing values: 50%, that is, a total of 3289 missing data

rng = np.random.RandomState(0) # 随机种子
missing_rate = 0.5 # 缺失值比例
n_missing_samples = int(np.floor(n_samples*n_features*missing_rate)) # np.floor()向下取整,返回.0格式的浮点数
n_missing_samples # 3289

2. Missing values ​​are scattered throughout the 506*13 data table - 3289 missing values ​​(number of grid cells composed of rows and columns) are generated at random positions. Similar to DataFrame, we need to locate by index (row, column) to generate missing values.

missing_samples = rng.randint(0,n_samples,n_missing_samples)  # 行中随机取出3289个数据
missing_features= rng.randint(0,n_features,n_missing_samples) # 列中随机取出3289个数据
# 使用上述的方式进行抽样,会使得数据远超样本量506(这里的样本量只按照行来计算)
# 我们还可以使用np.random.choice()来进行抽象,可以抽取不重复的随机数,确保数据不会集中在同一行中,某种程度上也保证了数据的分散度
missing_features

3. Generate missing values

x_missing = x_full.copy() # 对源数据集进行拷贝
x_missing[missing_samples,missing_features] = np.nan # 通过行、列索引随机定位生成缺失值
x_missing = pd.DataFrame(x_missing)
x_missing

Missing value filling
① Mean filling
uses the SimpleImputer class in sklearn.impute to fill in. missing_values=np.nan represents the type of currently required filling value (null value); strategy='mean' represents the strategy used to fill in the null value. It is filled with the mean mean.

# ①.使用均值进行填补
imp_mean = SimpleImputer(missing_values=np.nan,strategy='mean')
x_missing_mean = imp_mean.fit_transform(x_missing) # 训练fit() + 导出predict() ==> fit_transform()
x_missing_mean = pd.DataFrame(x_missing_mean) 
x_missing_mean

② Use 0 value to fill
strategy='constant', fill_value=0 means using constant to fill, fill_value indicates that the constant used is 0.

imp_0 = SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=0)
x_missing_0 = imp_mean.fit_transform(x_missing) # 训练fit() + 导出predict() ==> fit_transform()
x_missing_0 = pd.DataFrame(x_missing_0) 
x_missing_0

③ Use random forest regression to fill.
Any regression is a process of learning from the feature matrix and then solving the continuous label y. The reason why this process can be achieved is because the regression algorithm believes that there is some connection between the feature matrix and the label. In fact, labels and features can be converted into each other. For example, in a problem of "predicting "house prices" using area, environment, and the number of nearby schools, we can use "area, "environment", and "number of nearby schools" Use the data to predict "housing prices", or conversely use "environment", "number of nearby schools" and "housing prices" to predict "area" (somewhat similar to the "y=kx+b" equation where you know three and get one). Regression fills missing values ​​using this idea.

For a data with n features, where feature T has missing values, we regard feature T as the label, and the other n-1 features and the original label form a new feature matrix. For T, there is no missing part, which is our ytrain. This part of the data has both labels and features, and the missing part, which only has features but no labels, is the part we need to predict.

The other n-1 features corresponding to the non-missing values ​​of feature T + original labels: xtrain
The non-missing values ​​of feature T: ytrain

The other n-1 features corresponding to the missing value of the feature + the original label: xtest
The missing value of the feature: unknown, we need to predict the ytest

This approach is very suitable for situations where a certain feature is missing in large quantities but other features are complete!

What if, in addition to feature T, other features in the data also have missing values?

The answer is to traverse all features and start filling from the least missing features (because filling in the least missing features requires the least accurate information. When filling in a feature, first replace the missing values ​​​​of other features with 0, and each time a regression prediction is completed, Just put the predicted values ​​into the original feature matrix, and then continue to fill in the next feature. Each time the filling is completed, the number of features with missing values ​​will be reduced by one, so after each cycle, there are fewer and fewer features that need to be filled with 0s. . When we get to the last feature (this feature should have the most missing values ​​among all features), there are no other features that need to be filled with 0s, and we have used regression to fill in a lot of effective information for other features. Can be used to fill in the most missing features.
⑴ Sorting index of the number of missing values

x_missing_reg = x_missing.copy()
# 找出数据集中,缺失值从小到大排序的特征们的顺序
# np.argsort() --- 返回从小到大排序的顺序所对应的索引
sort_columns_index = np.argsort(x_missing_reg.isnull().sum()).values
sort_columns_index

(2) Traverse the index to fill the empty value

for i in sort_columns_index:
    
    # 构建新的特征矩阵(没有选中填充的特征 + 原始的标签)和新标签(被选中填充的特种)
    df = x_missing_reg
    fillc = df.iloc[:,i] # 当前要填值的一列特征 --- 新标签  
    df = pd.concat([df.iloc[:,df.columns != i],pd.DataFrame(y_full)],axis=1) # 其余n-1列和完整标签
    
    # 在新的特征矩阵中对含有缺失值的列进行空值填补
    df_0 = SimpleImputer(missing_values=np.nan,strategy='constant',fill_value=0).fit_transform(df)
    
    # 提取出测试集、训练集
    ytrain = fillc[fillc.notnull()] # 被选出来要填充的特征列中非空的数据 --- 训练标签
    ytest  = fillc[fillc.isnull()]  # 被选出来要填充的特征列中为空的数据 --- 测试标签
    xtrain = df_0[ytrain.index,:] # 在新特征矩阵中,被选出来要填充的特征的非空值所对应的记录
    xtest  = df_0[ytest.index,:]  # 在新特征矩阵中,被选出来要填充的特征空值所对应的记录
    
    # 使用随机森林回归来填补缺失值
    rfc = RandomForestRegressor(n_estimators=100).fit(xtrain,ytrain)
    y_predict = rfc.predict(xtest)
    
    # 将填补好的特征返回到我们的原始特征矩阵中
    x_missing_reg.loc[x_missing_reg.iloc[:,i].isnull(),i] = y_predict

④Evaluate the filling results.
We next use cross-validation (mean squared error) to score the original data set, mean-filled data set, 0-value filled data set, and random forest regression filled data set respectively.

# 对空值填补进行评估
X = [x_full,x_missing_mean,x_missing_0,x_missing_reg]
mse = [] # 使用均方误差进行评估

for x in X:
    estimator = RandomForestRegressor(n_estimators=100,random_state=0)
    scores = cross_val_score(estimator,x,y_full,scoring='neg_mean_squared_error',cv=5).mean()
    mse.append(scores * -1)
mse 

[21.571667100368845, 42.62658760318384, 42.62658760318384, 17.52358682764511]

Through evaluation, it can be found that the mean square error score of using the mean and 0 values ​​for null filling reaches more than 40, while the fitting effect of filling with random forest regression is even better than that of the original data set, with the mean square error score as low as 17.5. Of course, The occurrence of overfitting cannot be ruled out.

# 可视化
plt.figure(figsize=(12,8)) # 画布
colors = ['r','g','b','orange'] # 颜色
x_labels = ["x_full","x_missing_mean","x_missing_0","x_missing_reg"] # 标签

ax = plt.subplot(111) # 添加子图
for i in range(len(mse)):
    ax.barh(i,mse[i],color=colors[i],alpha=0.6,align='center')
    
ax.set_title('Imputation Technique with Boston Data') # 设置标题
ax.set_xlim(left=np.min(mse)*0.9,right=np.max(mse)*1.1) # 设置x轴的范围
ax.set_yticks(range(len(mse)))
ax.set_xlabel("MSE") # 设置x轴标签
ax.set_yticklabels(x_labels) # 设置y轴刻度

plt.show()

insert image description here

Tuning parameters

For tree models, the lusher the tree, the deeper it is, and the more branches and leaves it has, the more complex the model will be. Therefore, the tree model is a model that is naturally located in the upper right corner of the graph. The random forest is based on the tree model, so the random forest is also a model with high inherent complexity. The parameters of random forest are all aimed at one goal: reducing the complexity of the model, moving the model to the left of the image, and preventing overfitting. Of course, there is no absolute parameter adjustment.
So how does each parameter affect our complexity and model? We have been adjusting parameters by taking turns to find the optimal value on the learning curve, hoping to correct the accuracy to a relatively high level. However, we now understand the direction of random forest parameter adjustment: to reduce complexity, we can select those parameters that have a huge impact on complexity, study their monotonicity, and then focus on adjusting those parameters that can minimize complexity. . For those parameters that are not monotonous, or that increase complexity, we use them according to the situation, and in most cases we can even back off. Based on experience, a ranking of the influence of each parameter on the model was made. You can refer to this order when we adjust parameters.

parameter Impact on model evaluation performance on unknown data influence level
n_estimators Improve to stationary, n_estimators↑, does not affect the complexity of a single model ⭐⭐⭐⭐
max_depth There are increases and decreases. The default maximum depth is the highest complexity. Adjust the parameter max_depth↓ in the direction of decreasing complexity. The model will be simpler and move to the left of the image. ⭐⭐⭐
min_samples_leaf There are increases and decreases. The default minimum limit is 1, which is the highest complexity. Adjust the parameter min_samples_leaf↑ in the direction of decreasing complexity. The model will be simpler and move to the left of the image. ⭐⭐
min_samples_split There are increases and decreases. The default minimum limit is 2, which is the highest complexity. Adjust the parameter min_samples_split↑ in the direction of decreasing complexity. The model will be simpler and move to the left of the image. ⭐⭐
max_features There are increases and decreases. The default auto is the square root of the total number of features. It is located in the middle complexity. You can adjust the parameter max_features↓ in the direction of increasing complexity or decreasing the complexity. The model is simpler and the image moves to the left. max_features↑, the model is more complex, and the image is shifted to the right. max_features is the only parameter that can not only make the model simpler, but also make the model more complex. Therefore, when adjusting this parameter, we need to consider the direction of our parameter adjustment.
criterion There are increases and decreases, generally use gini Depend on the specific situation

Guess you like

Origin blog.csdn.net/qq_45694768/article/details/120878004