[Practical project] Implementation of xgboost regression model (XGBRegressor) project based on Python

Explanation: This is a machine learning practical project (with data + code + documentation + code explanation ). If you need data + code + documentation + code explanation, you can go directly to the end of the article to get it.

1.Project background

       With the advent of the big data era, it is crucial to have big data thinking. The application of artificial intelligence technology can be seen everywhere in all walks of life. In the manufacturing industry, artificial intelligence technology can greatly improve production efficiency, save labor costs, and improve product quality; in the service industry, it can optimize the industry's existing products and services and improve their quality and labor productivity; in finance, medical and other fields, it can also Artificial intelligence technology has become more prosperous and people's lives have become more convenient.

      As a necessity for every citizen, a house plays a very important role in life. Buying a house has become a topic that people talk about more. How to buy and sell a house at the right time has also become the focus of people's attention. Therefore, under this background, this book was born. issues related to house price prediction. Currently, there are two main issues in the field of housing price prediction: one is to choose an appropriate mathematical model to predict the direction of housing prices to evaluate changes in housing prices; the other is to find the causes of housing price changes, so that the state can use this to help the market coordinate housing prices Changes, citizens can judge the timing of purchasing based on current events. This project mainly analyzes the first problem, which is to choose an appropriate mathematical model to help predict housing prices.

      This project will start from the housing price data in a certain area, use the relevant attributes of the houses in the area as features, screen out important information, and appropriately process some information, and finally use it to predict the prices of other houses in the area.

2. Data acquisition

The modeling data for this time comes from the Internet (compiled by the author of this project). The statistics of the data items are as follows:

The data details are as follows:

3. Exploratory data analysis

3.1 Descriptive statistics of house prices

Key code:

      As can be seen from the above figure, the total amount of SalePrice data is 1460, the mean is 180921.1959, the standard deviation is 79442.5029, as well as the minimum value, maximum value and quantile.

3.2 House price histogram

Key code:

From the figure above, we can see that housing prices present a normal distribution.

3.3 Housing price skewness and kurtosis

 Key code:

 The results are shown in the table below:

      As can be seen from the table above, skewness greater than 0 means that the peak of the normal distribution is steeper - a sharp peak; kurtosis greater than 0 means that the positive deviation value is large, which is positive or right-skewed. The long tail trails to the right. Well, combined with the home price histogram, it can be determined that the long tail is indeed to the right, and the peak is steep.

3.4 OverallQual overall evaluation box plot

Key code:

As can be seen from the figure above, the higher the overall rating, the higher the price of the house.

3.5 YearBuilt Scatter plot of year built

Key code:

As you can see from the chart above, home prices increase linearly with the year of construction.

3.6 Correlation analysis

     As can be seen from the correlation heat map above, the correlation between house prices and these characteristics is positive, and the correlation is relatively strong.

Key code:

4. Data preprocessing

The real data may contain a large number of missing values ​​and noise data or human input errors lead to abnormal points, which is very unfavorable for the training of the algorithm model. The result of data cleaning is to process all kinds of dirty data in corresponding ways to obtain standard, clean and continuous data, which can be used for data statistics and data mining. Data preprocessing usually includes data cleaning, reduction, aggregation, conversion, sampling, etc. The quality of data preprocessing determines the accuracy and generalization value of subsequent data analysis, mining and modeling work. The following briefly introduces the main preprocessing methods in data preprocessing work:

4.1 Data Standardization

The standardized data is as follows:

 

Key code:

4.2 Processing of missing values ​​in test set data

Key code:

 As can be seen from the above figure, there are missing values ​​in the test data sets GarageCars and TotalBsmtSF; here, the missing values ​​are filled with the mean:

5. Feature engineering

5.1 Establish feature data and label data

SalePrice is label data, and everything other than SalePrice is feature data. The key code is as follows:

 

5.2 Data set splitting

 The training set is split into training set and validation set, 80% training set and 20% validation set. The key code is as follows:

6. Build xgboost regression model

      According to the characteristics of 7 variables in the data: "living area", "total number of rooms", "number of bathrooms", "total basement area", "garage", "year of construction" and "overall evaluation", the "house price" is predicted. Use XGBRegressor algorithm for target regression.

 6.1 Model parameters

 The key code is as follows:

7. Model evaluation

7.1 Evaluation indicators and results

Evaluation indicators mainly include explainable variance, mean absolute error, mean square error, R-square, etc.

As can be seen from the table above, the model fitting effect is good, with an R-square of 0.9, which is very close to 1.

The key code is as follows:

 

7.2 Comparison chart between actual value and predicted value

 

As can be seen from the figure above, the predicted values ​​are relatively consistent with the true values, indicating that the model fits well.

7.2 Importance of model features

 

As can be seen from the above figure, the feature importance is in order: GrLivArea, FullBath, TotRmsAbvGrd, etc.

8. Conclusion and suggestions

       To sum up, this article uses the xgboost model, which ultimately proves that the model we proposed works well. Through the establishment of this housing prediction model, it is possible to obtain not only the heavyweight characteristic of the total living area of ​​the house that affects the total value of the house, but also other influencing characteristics such as the total number of rooms and the number of bathrooms. Data helps us filter out important features, eliminate some "taken for granted" results, and better grasp the essence of things. This is a practical example of big data helping us solve things in life.

The prediction result data is as follows:

 The materials required for the actual implementation of this machine learning project, the project resources are as follows:

Link: https://pan.baidu.com/s/1dW3S1a6KGdUHK90W-lmA4w 
Extraction code: bcbp

If the network disk fails, you can add the blogger's WeChat account: zy10178083

Guess you like

Origin blog.csdn.net/weixin_42163563/article/details/121110926