[Python] random forest prediction


foreword

Building multiple decision trees and merging them to get a more accurate and stable model is a combination of bagging ideas and random selection of features. Random forest constructs multiple decision trees. When a sample needs to be predicted, the prediction results of each tree in the forest are counted, and then the final result is selected from these prediction results by voting.
Randomness is mainly reflected in the following two aspects:

1. Randomly take features
2. Randomly take samples, so that each tree in the forest has both similarity and difference


1. Why use RF

1. Advantages:

  1. High accuracy and efficient operation (parallel training between trees)

  2. High-dimensional features can be processed without dimensionality reduction

  3. A method to measure the importance of features is given

  4. Unbiased estimates are used internally during tree building

  5. There are good algorithms for dealing with missing values

  6. Ability to balance errors for class-imbalanced data

  7. Ability to measure the similarity between samples, and based on this similarity to cluster samples and filter outliers

  8. An empirical method to measure the interactivity of features is proposed (it can handle well when there are redundant features in the data)

  9. Can be extended to unsupervised learning

  10. Easy to check model accuracy (like ROC curve)

The above advantages are based on summary and personal opinion

2. Disadvantages:

  1. Black box, strong inexplicability, multiple randoms lead to very good results
  2. Can overfit on some noisy classification and regression problems
  3. Models can be very large, more accurate means more data
  4. The amount of generated decision trees is large, and the analysis is more troublesome

2. Use steps

1. Import library

The code is as follows (example):

import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
import seaborn as sns
from six import StringIO
from IPython.display import Image
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.tree import export_graphviz
import pydotplus
import os

2. Read data

The code is as follows (example):

data_train = pd.read_excel("data_train.xlsx")
# data_train.describe().to_excel('data_train_describe.xlsx')
# 数据描述性分析
print(data_train.describe())
# 数据完整性、数据类型查看
print(data_train.info())

Organize and descriptively analyze data using statistical methods


3. Missing value data processing

# 数据缺失值个数
total = data_train.isnull().sum().sort_values(ascending=False)
# 缺失值数据比例
percent = (data_train.isnull().sum() / data_train.isnull().count()).sort_values(ascending=False)
print(total)
print(percent)

The previous code is the statistics of missing value data, and the next code is the filling of missing value data.

# 缺失值填充
data_train['x1'] = data_train['x1'].fillna(0)
print(data_train.isnull().sum().max())

4. Processing of dummy variable data

# 哑变量处理
data_train.loc[data_train['x10'] == '类别1', 'x10'] = 1
data_train.loc[data_train['x10'] == '类别2', 'x10'] = 2
a = pd.get_dummies(data_train['x10'], prefix="x10")
frames = [data_train, a]
data_train = pd.concat(frames, axis=1)
data_train = data_train.drop(columns=['x10'])
data_train.to_excel('data_train_yucl.xlsx')

5. Characteristic variables

# 特征变量x1和标签变量y关系的散点图
var = 'x1'
data = pd.concat([data_train['y'], data_train[var]], axis=1)
data.plot.scatter(x=var, y='y')
plt.show()

# 特征变量x5和标签变量y关系的散点图
var0 = 'x5'
data0 = pd.concat([data_train['y'], data_train[var0]], axis=1)
data0.plot.scatter(x=var0, y='y')
plt.show()


# 特征数据和标签数据拆分
X = data_train.drop(columns=['y'])
y = data_train['y']

The analysis of characteristic data is to better find out the importance of variables. The essence of feature selection is to measure the superiority of a given feature subset by a specific evaluation criterion. Through feature selection, redundant (redundant) features and irrelevant (irrelevant) features in the original feature set are removed. However, the useful features are preserved. In this way, the model is the same, the data is the same, but the selection of characteristic variables will have a huge impact on the results, which also shows the importance of selecting different characteristic variables for analysis in different environments.

insert image description here
insert image description here
insert image description here

6. Modeling

#  建模
forest = RandomForestRegressor(
    n_estimators=100,
    random_state=1,
    n_jobs=-1)
forest.fit(X_train, Y_train)

score = forest.score(X_validation, Y_validation)
print('随机森林模型得分: ', score)
y_validation_pred = forest.predict(X_validation)

By calling RandomForestRegressor, we can model, of course, this is the most basic modeling.

7. Verification set result output comparison

# 验证集结果输出与比对
plt.figure()
plt.plot(np.arange(1000), Y_validation[:1000], "go-", label="True value")
plt.plot(np.arange(1000), y_validation_pred[:1000], "ro-", label="Predict value")
plt.title("True value And Predict value")
plt.legend()
plt.show()

insert image description here

8. Decision tree

# 生成决策树
# dot_data = StringIO()
with open('./wine.dot','w',encoding='utf-8') as f:
    f=export_graphviz(pipe.named_steps['regressor'].estimators_[0], out_file=f)
# graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
# graph.write_png('tree.png')
# Image(graph.create_png())

9. Model Feature Importance

col = list(X_train.columns.values)
importances = forest.feature_importances_
x_columns = ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10_类别1', 'x10_类别2']
# print("重要性:", importances)
# 返回数组从大到小的索引值

insert image description here

Summarize

Through the prediction of the random forest model, it can be found that the predicted data is very close to the real data and has a high score.

Guess you like

Origin blog.csdn.net/m0_65157892/article/details/129502566