[Practical project] Python implements support vector machine SVM regression model (SVR algorithm) project practice

Explanation: This is a machine learning practical project (with data + code + documentation + code explanation ). If you need data + code + documentation + code explanation, you can go directly to the end of the article to get it.

1.Project background

      Support vector machines can be used for regression problems, namely support vector machine regression, referred to as support vector regression (SVR). Support vector machine (SVM) is based on the VC dimension theory and the principle of structural risk minimization. It was originally used to solve the binary classification problem (support vector machine classification), and was later extended to solve the function approximation problem, that is, support vector regression. (SVR). Generally speaking, kernel techniques can be used to transform the nonlinear sample set as input into a high-dimensional space to improve sample separation. This project uses the svr algorithm for modeling and prediction.

2. Data acquisition

The modeling data for this time comes from the Internet (compiled by the author of this project). The statistics of the data items are as follows:

The data details are as follows (partially displayed):

3. Data preprocessing

       The real data may contain a large number of missing values ​​and noise data or human input errors lead to abnormal points, which is very unfavorable for the training of the algorithm model. The result of data cleaning is to process all kinds of dirty data in corresponding ways to obtain standard, clean and continuous data, which can be used for data statistics and data mining. Data preprocessing usually includes data cleaning, reduction, aggregation, conversion, sampling, etc. The quality of data preprocessing determines the accuracy and generalization value of subsequent data analysis, mining and modeling work. The following briefly introduces the main preprocessing methods in data preprocessing work:

3.1 Use Pandas tool to view data

Use the head() method of the Pandas tool to view the first five rows of data:

 As you can see from the picture above, there are a total of 14 fields: 13 variables, of which the first column is an index, which we need to remove when modeling.

Key code:

3.2 View the shape of the data

Use the shape attribute of the Pandas tool to view the number of data sets and the number of fields:

 

As you can see from the picture above, there are a total of 14 variables and a data volume of 18,249 pieces of data.

Key code:

3.3 Judging the null value of variables

Use the isnull().sum() method of the Pandas tool to count the null values ​​of variables. The results are as follows:

 

As you can see from the picture above, there are no null values ​​in the data fields.

The key code is as follows:

3.4 Descriptive statistical analysis of data

 Use the describe() method of the Pandas tool to perform descriptive statistical analysis of the data. The results are as follows:

As you can see from the figure above, the mean, standard deviation, minimum value, maximum value and quantile of the data items.

The key code is as follows:

3.5 View data summary information

 Use the info() method of the Pandas tool to view the data summary information. The result is as follows:

 The key code is as follows:

3.6 Delete data items

Delete the Unnamed: 0 data item. The key code is as follows:

 

3.7 Convert date format and sort

The key code is as follows:

 View the sorted data:

4. Exploratory data analysis

4.1 Average Price of Conventional Avocados Over Time

Use the scatter() method of the Matplotlib tool to perform statistical drawings. The graphical display is as follows:

 As you can see from the figure above, the average price of conventional Avocados was higher around November 2016 and October 2017.

4.2 Average Price of Organic Avocados Over Time

 As you can see from the figure above, the average price of Avocados of type Organic was relatively high around November 2016 and around May 2017. In addition, the average price of different types of Avocados can be seen from the figures 4.1 and 4.2. The average price of the Organic type is slightly higher than that of the Conventional type.

4.3 Plot weekly average prices by month

 As you can see from the chart above, the average weekly price was the highest in September and October 2017.

4.4 Check whether the sample set is balanced

 

 As you can see from the picture above, the data are all 338, and the data is relatively balanced.

4.5 Price display by region

 As you can see from the chart above, the Hartford Springfield area has the highest average price, and the Houston area has the lowest average price.

4.6 Price display by type

As you can see from the picture above, the average price of the organic type reaches 1.65, which is higher than the average price of the conventional type.

4.7 Correlation analysis

View it through the corr() method of the Pandas tool as follows:

 As can be seen from the above figure, the positive correlation between data items is positive, and the negative value is negative correlation; the larger the value, the stronger the correlation.

4.8 Drawing a pie chart

 As can be seen from the above figure, in 2015 SmallHass accounted for the largest share of 40.2%; XLarge Bags accounted for the least share of 0.1%.

As can be seen from the above figure, in 2016 SmallHass accounted for the largest share of 30.9%; XLarge Bags accounted for the least share of 0.4%.

 As you can see from the picture above, in 2017 SmallHass accounted for the largest share of 33.0%; XLarge Bags accounted for the least share of 0.5%.

 As you can see from the picture above, in 2018 SmallHass accounted for the largest share of 33.1%; XLarge Bags accounted for the least share of 0.5%.

4.9 Check the data correlation after data preprocessing

 As can be seen from the above figure, the correlation value of data items after data preprocessing, positive value is positive correlation, negative value is negative correlation, the larger the value, the stronger the correlation.

5. Feature engineering

5.1 Data standardization

Standardize the data items of Small Hass:XLarge Bags. The key code is as follows:

 Standardized data, as shown below:

5.2 Establish feature data and label data

AveragePrice is label data, and data other than AveragePrice is feature data. The key code is as follows:

 

5.3 Dumb feature processing

Since the type and region data items are categorical variables and are of the type in this article, they do not meet the requirements of machine learning modeling; for type and region, dummy features are processed and converted into numeric types. The key codes are as follows:

 The result after transformation is as shown in the figure below:

5.4 Visualizing variables that are highly correlated with the average price variable

 As can be seen from the above figure, as the Small Hass increases, the average price generally shows a downward trend.

As can be seen from the above figure, as the size of Small Bags increases, the average price remains relatively stable overall.

 As can be seen from the above figure, as Large Bags increase, the average price is generally relatively stable.

As can be seen from the above figure, the organic type has less impact on the average price than the conventional type.

5.5 Data set splitting

The training set is split into training set and validation set, 70% training set and 30% validation set. The key code is as follows:

 

6. Build SVR regression model

Mainly uses the svr algorithm for target regression.

6.1 Model parameters

 

The key code is as follows:

7. Model evaluation

7.1 Evaluation indicators and results

The evaluation indicators mainly include score, explainable variance value, mean square error, R-square value, etc.

 

As can be seen from the table above, the score is 0.82, and the svr regression model is good.

The key code is as follows:

7.2 Comparison chart between true value and predicted value 

 It can be seen from the above figure that the fluctuations of the real value and the predicted value are basically consistent, and the model fitting effect is good.

8. Conclusion and outlook

To sum up, this article uses the svr regression model, which ultimately proves that the model we proposed works well. Can be used for modeling and forecasting in actual business.

The materials required for the actual implementation of this machine learning project, the project resources are as follows:

Project description:
Link: https://pan.baidu.com/s/1dW3S1a6KGdUHK90W-lmA4w 
Extraction code: bcbp

If the network disk fails, you can add the blogger's WeChat account: zy10178083

Guess you like

Origin blog.csdn.net/weixin_42163563/article/details/121963661
Recommended