1. Project overview
-
Data source: self-seeking data on the Internet Baidu cloud disk Link: https://pan.baidu.com/s/1Lmhl34BumjBloN-rhy7Yqw Extraction code: z84d
- Project background: Local fiscal revenue refers to the sum of all funds raised by the government to perform its functions, implement public policies, and provide public goods and services. It is not only an important part of national fiscal revenue, but also has its relatively independent composition content. How to formulate local fiscal expenditure plans, rationally distribute local fiscal revenue, promote local development, and improve citizens' income and quality of life are the primary issues that every local government needs to consider. Therefore, local fiscal revenue forecasting is very necessary. This case uses data mining technology to analyze the city's fiscal revenue according to the data from 1994 to 2013 after China's fiscal system reform, and predicts the fiscal revenue in the next two years, hoping to help the government reasonably control fiscal revenue and expenditure and optimize fiscal revenue construction to provide a basis for making relevant decisions.
- The design goals are as follows: (1) Analyze and identify key attributes that affect local fiscal revenue (2) Predict changes in fiscal revenue in 2014 and 2015
- 1. Specific application of the project
- Import the necessary packages:
import numpy as np import pandas as pd import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns from scipy import stats from sklearn.model_selection import GridSearchCV from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV, RepeatedKFold, cross_val_score,cross_val_predict,KFold from sklearn.metrics import make_scorer,mean_squared_error from sklearn.linear_model import LinearRegression, LassoCV, Ridge, ElasticNet from sklearn.svm import LinearSVR, SVR from sklearn.neighbors import KNeighborsRegressor # 使用r2_score作为回归模型性能的评价 from sklearn.metrics import r2_score #用来正常显示中文 plt.rcParams['font.sans-serif']=['SimHei'] #用来显示负号 plt.rcParams['axes.unicode_minus']=False
read data:
# 数据读取 data = pd.read_csv('data(1).csv') #通过观察前5行,了解数据每列(特征)的概况 data.head()
Feature dependencies:
data.info()
Missing value analysis:
data.isnull().sum()#缺失值分析
Repeat value analysis:
data.duplicated()#重复值分析
Descriptive statistical analysis:
data.describe()
Plot a histogram with continuous probability density estimates :
for column in data.columns: fig,ax = plt.subplots(figsize=(6,6)) sns.distplot(data.loc[:,column],norm_hist=True,bins=20)
Correlation Analysis :
Correlation analysis refers to the analysis of two or more characteristic elements with correlation type, so as to measure the degree of correlation between the two characteristic factors. In statistics, the Pearson correlation coefficient is commonly used for correlation analysis. The Pearson correlation coefficient can be used to measure the relationship between two features (linear correlation strength). It is the simplest correlation coefficient, and the value range is [-1,1].
corr=data.corr(method='pearson') corr
result:
It can be found that the above variables except x 11 have a strong correlation with y , and there is multicollinearity among these attributes. Consider using the Lasso feature selection model for feature selection to draw a correlation heat map to visually display the correlation .
-
Draw a heat map:
# 绘制热力图 plt.style.use('ggplot') sns.set_style('whitegrid') plt.subplots(figsize=(10,10)) sns.heatmap(data.corr(method='pearson'), cmap='Reds', annot=True, square=True, fmt='.2f', yticklabels=corr.columns, xticklabels=corr.columns )
result:
3. Data preprocessing
-
Extract Key Attributes Using Lasso Feature Selection Model
import pandas as pd import numpy as np from sklearn.linear_model import Lasso data = pd.read_csv('data(1).csv', header=0) x, y = data.iloc[:, :-1], data.iloc[:, -1] lasso = Lasso(alpha=1000, random_state=1) lasso.fit(x, y) print('相关系数为', np.round(lasso.coef_, 5)) coef = pd.DataFrame(lasso.coef_, index=x.columns) print('相关系数数组为\n', coef) mask = lasso.coef_ != 0.0 x = x.loc[:, mask] mask = np.append(mask,True) new_reg_data = data.iloc[:,mask] new_reg_data = pd.concat([x, y], axis=1) new_reg_data.to_csv('new_reg_data.csv')
result
gray forecasting model
# 自定义灰色预测函数 def GM11(x0): x1 = x0.cumsum() z1 = (x1[:len(x1) - 1] + x1[1:]) / 2.0 z1 = z1.reshape((len(z1), 1)) B = np.append(-z1, np.ones_like(z1), axis=1) Yn = x0[1:].reshape((len(x0) - 1, 1)) [[a], [b]] = np.dot(np.dot(np.linalg.inv(np.dot(B.T, B)), B.T), Yn) f = lambda k: (x0[0] - b / a) * np.exp(-a * (k - 1)) - (x0[0] - b / a) * np.exp(-a * (k - 2)) delta = np.abs(x0 - np.array([f(i) for i in range(1, len(x0) + 1)])) C = delta.std() / x0.std() P = 1.0 * (np.abs(delta - delta.mean()) < 0.6745 * x0.std()).sum() / len(x0) return f, a, b, x0[0], C, P new_reg_data = pd.read_csv('new_reg_data.csv', header=0, index_col=0) data = pd.read_csv('data(1).csv', header=0) new_reg_data.index = range(1994, 2014) new_reg_data.loc[2014] = None new_reg_data.loc[2015] = None cols = ['x1', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x13'] for i in cols: f = GM11(new_reg_data.loc[range(1994, 2014), i].values)[0] new_reg_data.loc[2014, i] = f(len(new_reg_data) - 1) new_reg_data.loc[2015, i] = f(len(new_reg_data)) new_reg_data[i] = new_reg_data[i].round(2) y = list(data['y'].values) y.extend([np.nan, np.nan]) new_reg_data['y'] = y new_reg_data.to_excel('new_reg_data_GM11.xls') print('预测结果为:\n', new_reg_data.loc[2014:2015, :])
result
Secondly, when i=l[i], enter the corresponding column to traverse from 1994-2013, and then predict the values of 2014 and 2015 based on the data of 1994-2013 and save them in the data table
l = ['x1','x3','x4','x5','x6','x7','x8','x13'] for i in l: f = GM11(new_reg_data.loc[range(1994,2014),i].as_matrix())[0] print('i:',i) print(new_reg_data.loc[range(1994,2014),i]) new_reg_data.loc[2014,i] = f(len(new_reg_data)-1) print(new_reg_data.loc[2014,i]) new_reg_data.loc[2015,i] = f(len(new_reg_data)) print(new_reg_data.loc[2015,i]) new_reg_data[i] = new_reg_data[i].round(2) # 保留2位小数 print("*"*50)
Building a Support Vector Machine Regression Model
Fiscal revenue forecasts for 2014 and 2015 using support vector regression models
from sklearn.svm import LinearSVR
import matplotlib.pyplot as plt
data = pd.read_excel('new_reg_data_GM11.xls',index_col=0,header=0)
feature = ['x1', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x13']
data_train = data.loc[range(1994, 2014)].copy()
data_mean = data_train.mean()
data_std = data_train.std()
data_train = (data_train - data_mean) / data_std
x_train = data_train[feature].values
y_train = data_train['y'].values
linearsvr = LinearSVR()
linearsvr.fit(x_train, y_train)
x = ((data[feature] - data_mean[feature]) / data_std[feature]).values
data[u'y_pred'] = linearsvr.predict(x) * data_std['y'] + data_mean['y']
data.to_excel('new_reg_data_GM11_revenue.xls')
print('真实值与预测值分别为:\n', data[['y', 'y_pred']])
fig = data[['y', 'y_pred']].plot(subplots=True, style=['b-o', 'r-*'])
plt.show()
result