Python implements the AdaBoost regression model (AdaBoostRegressor algorithm) and applies the grid search algorithm to tune the actual project

Explanation: This is a machine learning practical project (with data + code + documentation + code explanation ). If you need data + code + documentation + code explanation, you can go directly to the end of the article to get it.




1. Project Background

The AdaBoost algorithm (Adaptive Boosting) is an effective and practical Boosting algorithm that sequentially trains weak learners in a highly adaptive manner. For the classification problem, the AdaBoost algorithm adjusts the weight of the data according to the previous classification effect. The weight of the wrongly classified sample in the last weak learner will increase in the next weak learner, and the weight of the correctly classified sample will decrease accordingly, and A new weak learner is added to the model at each iteration. Constantly adjust the weights and train the weak learner until the number of misclassifications is lower than the preset value or the number of iterations reaches the specified maximum, and finally a strong learner is obtained. To put it simply, the core idea of ​​the AdaBoost algorithm is to adjust the weight of the wrong sample, and then upgrade iteratively.

This project uses the Adaboost regression algorithm to construct the model, and uses the grid search algorithm to find the optimal parameter value.

2. Data Acquisition

The modeling data for this time comes from the Internet (compiled by the author of this project), and the statistics of the data items are as follows:

The data details are as follows (partial display):

 

3. Data preprocessing

3.1 View data  with Pandas tool

Use the head() method of the Pandas tool to view the first five rows of data:

key code:

 

3.2 Data missing view

Use the info() method of the Pandas tool to view data information:

As can be seen from the above figure, there are a total of 9 variables, no missing values ​​in the data, and a total of 1000 data.

key code:

3.3 Data descriptive statistics 

Use the describe() method of the Pandas tool to view the mean, standard deviation, minimum, quantile, and maximum of the data.

The key code is as follows:

 

4. Exploratory Data Analysis

4.1 Histogram of y variables 

Use the hist() method of the Matplotlib tool to draw a histogram:

As can be seen from the figure above, the y variable is mainly concentrated between -300 and 300.

4.2 Correlation analysis

As can be seen from the figure above, the larger the value, the stronger the correlation. A positive value is a positive correlation, and a negative value is a negative correlation.

5. Feature Engineering

5.1 Establish feature data and label data

The key code is as follows:

5.2 Dataset splitting

Use the train_test_split() method to divide according to 80% training set and 20% test set. The key code is as follows:

6. Build Ada Boost regression model

Mainly use AdaBoost regression algorithm and grid search algorithm for target regression.

6.1 Default parameter construction model

6.2  Grid search for optimal parameter values

 

6.3 Optimal  parameter value construction model

 

7. Model Evaluation

7.1 Evaluation indicators and results 

The evaluation indicators mainly include explainable variance value, mean absolute error, mean square error, R square value and so on.

It can be seen from the above table that the R square is 0.889; the R square of the default parameters is 0.8552, and the model effect has been improved after grid search optimization.

The key code is as follows:

7.2  Comparison chart of actual value and predicted value

 

From the above figure, it can be seen that the fluctuations of the actual value and the predicted value are basically the same.

7.3 Display of the Importance of  Model Features

The importance of each feature can be seen from the figure above.

8. Conclusion and Outlook

To sum up, this paper uses the AdaBoost algorithm to construct the regression model, and uses the grid search algorithm to find the optimal parameter value, which finally proves that the model we proposed works well. This model can be used for forecasting of everyday products.

# y变量分布直方图
fig = plt.figure(figsize=(8, 5))  # 设置画布大小
plt.rcParams['font.sans-serif'] = 'SimHei'  # 设置中文显示
plt.rcParams['axes.unicode_minus'] = False  # 解决保存图像是负号'-'显示为方块的问题
data_tmp = data['y']  # 过滤出y变量的样本
# 绘制直方图  bins:控制直方图中的区间个数 auto为自动填充个数  color:指定柱子的填充色
plt.hist(data_tmp, bins='auto', color='g')
plt.xlabel('y')  # 设置x轴名称



# 本次机器学习项目实战所需的资料,项目资源如下:

# 项目说明:
# 链接:https://pan.baidu.com/s/1dW3S1a6KGdUHK90W-lmA4w 
# 提取码:bcbp



print('AdaBoost回归模型-默认参数-R^2:', round(r2_score(y_test, y_pred), 4))
print('AdaBoost回归模型-默认参数-均方误差:', round(mean_squared_error(y_test, y_pred), 4))
print('AdaBoost回归模型-默认参数-解释方差分:', round(explained_variance_score(y_test, y_pred), 4))
print('AdaBoost回归模型-默认参数-绝对误差:', round(mean_absolute_error(y_test, y_pred), 4))

  For more project practice, see the list of machine learning project practice collections:

List of actual combat collections of machine learning projects


Guess you like

Origin blog.csdn.net/weixin_42163563/article/details/131673800