Explanation: This is a machine learning practical project (with data + code + documentation + code explanation ). If you need data + code + documentation + code explanation, you can go directly to the end of the article to get it.
1.Project background
Support vector machines can be used for regression problems, namely support vector machine regression, referred to as support vector regression (SVR). Support vector machine (SVM) is based on the VC dimension theory and the principle of structural risk minimization. It was originally used to solve the binary classification problem (support vector machine classification), and was later extended to solve the function approximation problem, that is, support vector regression. (SVR). Generally speaking, kernel techniques can be used to transform the nonlinear sample set as input into a high-dimensional space to improve sample separation. This project uses the svr algorithm for modeling and prediction.
2. Data acquisition
The modeling data for this time comes from the Internet (compiled by the author of this project). The statistics of the data items are as follows:
The data details are as follows (partially displayed):
3. Data preprocessing
The real data may contain a large number of missing values and noise data or human input errors lead to abnormal points, which is very unfavorable for the training of the algorithm model. The result of data cleaning is to process all kinds of dirty data in corresponding ways to obtain standard, clean and continuous data, which can be used for data statistics and data mining. Data preprocessing usually includes data cleaning, reduction, aggregation, conversion, sampling, etc. The quality of data preprocessing determines the accuracy and generalization value of subsequent data analysis, mining and modeling work. The following briefly introduces the main preprocessing methods in data preprocessing work:
3.1 Use Pandas tool to view data
Use the head() method of the Pandas tool to view the first five rows of data:
As you can see from the picture above, there are a total of 14 fields: 13 variables, of which the first column is an index, which we need to remove when modeling.
Key code:
3.2 View the shape of the data
Use the shape attribute of the Pandas tool to view the number of data sets and the number of fields:
As you can see from the picture above, there are a total of 14 variables and a data volume of 18,249 pieces of data.
Key code:
3.3 Judging the null value of variables
Use the isnull().sum() method of the Pandas tool to count the null values of variables. The results are as follows:
As you can see from the picture above, there are no null values in the data fields.
The key code is as follows:
3.4 Descriptive statistical analysis of data
Use the describe() method of the Pandas tool to perform descriptive statistical analysis of the data. The results are as follows:
As you can see from the figure above, the mean, standard deviation, minimum value, maximum value and quantile of the data items.
The key code is as follows:
3.5 View data summary information
Use the info() method of the Pandas tool to view the data summary information. The result is as follows:
The key code is as follows:
3.6 Delete data items
Delete the Unnamed: 0 data item. The key code is as follows:
3.7 Convert date format and sort
The key code is as follows:
View the sorted data:
4. Exploratory data analysis
4.1 Average Price of Conventional Avocados Over Time
Use the scatter() method of the Matplotlib tool to perform statistical drawings. The graphical display is as follows:
As you can see from the figure above, the average price of conventional Avocados was higher around November 2016 and October 2017.
4.2 Average Price of Organic Avocados Over Time
As you can see from the figure above, the average price of Avocados of type Organic was relatively high around November 2016 and around May 2017. In addition, the average price of different types of Avocados can be seen from the figures 4.1 and 4.2. The average price of the Organic type is slightly higher than that of the Conventional type.
4.3 Plot weekly average prices by month
As you can see from the chart above, the average weekly price was the highest in September and October 2017.
4.4 Check whether the sample set is balanced
As you can see from the picture above, the data are all 338, and the data is relatively balanced.
4.5 Price display by region
As you can see from the chart above, the Hartford Springfield area has the highest average price, and the Houston area has the lowest average price.
4.6 Price display by type
As you can see from the picture above, the average price of the organic type reaches 1.65, which is higher than the average price of the conventional type.
4.7 Correlation analysis
View it through the corr() method of the Pandas tool as follows:
As can be seen from the above figure, the positive correlation between data items is positive, and the negative value is negative correlation; the larger the value, the stronger the correlation.
4.8 Drawing a pie chart
As can be seen from the above figure, in 2015 SmallHass accounted for the largest share of 40.2%; XLarge Bags accounted for the least share of 0.1%.
As can be seen from the above figure, in 2016 SmallHass accounted for the largest share of 30.9%; XLarge Bags accounted for the least share of 0.4%.
As you can see from the picture above, in 2017 SmallHass accounted for the largest share of 33.0%; XLarge Bags accounted for the least share of 0.5%.
As you can see from the picture above, in 2018 SmallHass accounted for the largest share of 33.1%; XLarge Bags accounted for the least share of 0.5%.
4.9 Check the data correlation after data preprocessing
As can be seen from the above figure, the correlation value of data items after data preprocessing, positive value is positive correlation, negative value is negative correlation, the larger the value, the stronger the correlation.
5. Feature engineering
5.1 Data standardization
Standardize the data items of Small Hass:XLarge Bags. The key code is as follows:
Standardized data, as shown below:
5.2 Establish feature data and label data
AveragePrice is label data, and data other than AveragePrice is feature data. The key code is as follows:
5.3 Dumb feature processing
Since the type and region data items are categorical variables and are of the type in this article, they do not meet the requirements of machine learning modeling; for type and region, dummy features are processed and converted into numeric types. The key codes are as follows:
The result after transformation is as shown in the figure below:
5.4 Visualizing variables that are highly correlated with the average price variable
As can be seen from the above figure, as the Small Hass increases, the average price generally shows a downward trend.
As can be seen from the above figure, as the size of Small Bags increases, the average price remains relatively stable overall.
As can be seen from the above figure, as Large Bags increase, the average price is generally relatively stable.
As can be seen from the above figure, the organic type has less impact on the average price than the conventional type.
5.5 Data set splitting
The training set is split into training set and validation set, 70% training set and 30% validation set. The key code is as follows:
6. Build SVR regression model
Mainly uses the svr algorithm for target regression.
6.1 Model parameters
The key code is as follows:
7. Model evaluation
7.1 Evaluation indicators and results
The evaluation indicators mainly include score, explainable variance value, mean square error, R-square value, etc.
As can be seen from the table above, the score is 0.82, and the svr regression model is good.
The key code is as follows:
7.2 Comparison chart between true value and predicted value
It can be seen from the above figure that the fluctuations of the real value and the predicted value are basically consistent, and the model fitting effect is good.
8. Conclusion and outlook
To sum up, this article uses the svr regression model, which ultimately proves that the model we proposed works well. Can be used for modeling and forecasting in actual business.
The materials required for the actual implementation of this machine learning project, the project resources are as follows:
Project description:
Link: https://pan.baidu.com/s/1dW3S1a6KGdUHK90W-lmA4w
Extraction code: bcbpIf the network disk fails, you can add the blogger's WeChat account: zy10178083