[Practical Machine Learning Project] Python implementation of random forest regression (RandomForestRegressor) model

Note: This is a practical machine learning project (with data + code). If you need data + complete code, you can go directly to the end of the article.

 

 1. Define the problem

In the field of e-commerce, more and more sales are forecasted based on historical purchasing data, order data, etc.; this model is also based on some historical data of e-commerce for modeling and forecasting sales.

2. Get data

This data is simulated data and is divided into two parts of data:

Training data set: data_train.xlsx

Test data set: data_test.xlsx

In actual applications, you can just replace it according to your own data.

Feature data: x1, x2, x3, x4, x5, x6, x7, x8, x9, x10

Label data: y

3. Data preprocessing 

1) Descriptive analysis of data:

data_train.describe()

 2) Data integrity and data type viewing:

data_train.info()

 3) Number of missing values ​​in data:

4) Proportion of missing value data:


 

 

5) Filling of missing values: Based on business analysis, it is more appropriate to fill in 0:

 

 

6) Dumb variable processing

data_train.loc[data_train['x10'] == '类别1', 'x10'] = 1
data_train.loc[data_train['x10'] == '类别2', 'x10'] = 2
a = pd.get_dummies(data_train['x10'], prefix="x10")
frames = [data_train, a]
data_train = pd.concat(frames, axis=1)

The value of x10 in the feature variable is text type: type 1, type 2, which does not meet the requirements of machine learning data and needs to be processed as a dummy feature to become a value of 0 1.

After processing, the data is as follows:

 4. Exploratory data analysis

1) Target data sales analysis:

print(data_train['y'].describe())

 

 

 2) Scatter plot of the relationship between feature variable x1 and label variable y:

var = 'x1'
data = pd.concat([data_train['y'], data_train[var]], axis=1)
data.plot.scatter(x=var, y='y')

 3) Scatter plot of the relationship between feature variable x5 and label variable y:

var0 = 'x5'
data0 = pd.concat([data_train['y'], data_train[var0]], axis=1)
data0.plot.scatter(x=var0, y='y')

 

4) Correlation analysis

 

 5. Feature engineering 

1) Feature data and label data are split, y is label data, and everything other than y is feature data;

2) The training set is split into a training set and a validation set, 80% of the training set and 20% of the validation set;

There are many other contents of feature engineering, such as data standardization, dimensionality reduction, etc. This depends on the actual situation and is not required for this modeling.

6. Machine modeling 

1) Establish a random forest regression model with the following model parameters:

serial number

parameter

1

n_estimators=100

2

random_state=1

3

n_jobs=-1

Other parameters are set based on specific data.

forest = RandomForestRegressor(
    n_estimators=100,
    random_state=1,
    n_jobs=-1)
forest.fit(X_train, Y_train)

2) Validation set result output and comparison: on the one hand, it generates excel table data; on the other hand, it generates a line chart.

plt.figure()
plt.plot(np.arange(1000), Y_validation[:1000], "go-", label="True value")
plt.plot(np.arange(1000), y_validation_pred[:1000], "ro-", label="Predict value")
plt.title("True value And Predict value")
plt.legend()

3) Generate decision tree

with open('./wine.dot','w',encoding='utf-8') as f:
    f=export_graphviz(pipe.named_steps['regressor'].estimators_[0], out_file=f)

Since there are many trees and all are converted into pictures at once, the pictures are not clear, so the generated format is .dot format. You can convert dot to pictures according to your specific needs.

Regardless of display: more than 200 pages in total.

7. Model evaluation

1) The evaluation indicators mainly use accuracy score, MAE, MSE, and RMSE.

score = forest.score(X_validation, Y_validation)
print('随机森林模型得分: ', score)
print('Mean Absolute Error:', metrics.mean_absolute_error(Y_validation, y_validation_pred))
print('Mean Squared Error:', metrics.mean_squared_error(Y_validation, y_validation_pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(Y_validation, y_validation_pred)))

serial number

Evaluation indicator name

Evaluation indicator value

1

accuracy score

0.9769

2

MAE

9.9431

3

MSE

2625.5679

4

RMSE

51.2402

As can be seen from the above table, this random forest model works well.

2) Importance of model features: On the one hand, it is output to excel; on the other hand, it is generated as a histogram.

col = list(X_train.columns.values)
importances = forest.feature_importances_
x_columns = ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x9', 'x10_类别1', 'x10_类别2']
# print("重要性:", importances)
# 返回数组从大到小的索引值
indices = np.argsort(importances)[::-1]
list01 = []
list02 = []
for f in range(X_train.shape[1]):
    # 对于最后需要逆序排序,我认为是做了类似决策树回溯的取值,从叶子收敛
    # 到根,根部重要程度高于叶子。
    print("%2d) %-*s %f" % (f + 1, 30, col[indices[f]], importances[indices[f]]))
    list01.append(col[indices[f]])
    list02.append(importances[indices[f]])

from pandas.core.frame import DataFrame

c = {"columns": list01, "importances": list02}
data_impts = DataFrame(c)
data_impts.to_excel('data_importances.xlsx')

importances = list(forest.feature_importances_)
feature_list = list(X_train.columns)

feature_importances = [(feature, round(importance, 2)) for feature, importance in zip(feature_list, importances)]
feature_importances = sorted(feature_importances, key=lambda x: x[1], reverse=True)

import matplotlib.pyplot as plt

x_values = list(range(len(importances)))
print(x_values)
plt.bar(x_values, importances, orientation='vertical')
plt.xticks(x_values, feature_list, rotation=6)
plt.ylabel('Importance')
plt.xlabel('Variable')
plt.title('Variable Importances')

8.Practical application

Predict sales based on the feature data of the last week (the data here is unlabeled data prepared in advance). The prediction results are as follows;

 

Stocking can be done based on forecasted sales.

The materials required for the actual implementation of this machine learning project, the project resources are as follows: https://download.csdn.net/download/weixin_42163563/21093418

Guess you like

Origin blog.csdn.net/weixin_42163563/article/details/119715312