Analysis and Forecast of Influencing Factors of Fiscal Revenue

1. Project overview

  1. Data source: self-seeking data on the Internet Baidu cloud disk Link: https://pan.baidu.com/s/1Lmhl34BumjBloN-rhy7Yqw Extraction code: z84d

     

  2. Project background: Local fiscal revenue refers to the sum of all funds raised by the government to perform its functions, implement public policies, and provide public goods and services. It is not only an important part of national fiscal revenue, but also has its relatively independent composition content. How to formulate local fiscal expenditure plans, rationally distribute local fiscal revenue, promote local development, and improve citizens' income and quality of life are the primary issues that every local government needs to consider. Therefore, local fiscal revenue forecasting is very necessary. This case uses data mining technology to analyze the city's fiscal revenue according to the data from 1994 to 2013 after China's fiscal system reform, and predicts the fiscal revenue in the next two years, hoping to help the government reasonably control fiscal revenue and expenditure and optimize fiscal revenue construction to provide a basis for making relevant decisions.
  3. The design goals are as follows: (1) Analyze and identify key attributes that affect local fiscal revenue (2) Predict changes in fiscal revenue in 2014 and 2015
  4. 1. Specific application of the project
  • Import the necessary packages:
    import numpy as np 
    import pandas as pd 
    import matplotlib.pyplot as plt
    %matplotlib inline
    import seaborn as sns
    from scipy import stats
    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import train_test_split
    from sklearn.model_selection import GridSearchCV, RepeatedKFold, cross_val_score,cross_val_predict,KFold
    from sklearn.metrics import make_scorer,mean_squared_error
    from sklearn.linear_model import LinearRegression, LassoCV, Ridge, ElasticNet
    from sklearn.svm import LinearSVR, SVR
    from sklearn.neighbors import KNeighborsRegressor
    # 使用r2_score作为回归模型性能的评价
    from sklearn.metrics import r2_score 
     #用来正常显示中文
    plt.rcParams['font.sans-serif']=['SimHei'] 
    #用来显示负号
    plt.rcParams['axes.unicode_minus']=False 

    read data:

    #  数据读取
    data = pd.read_csv('data(1).csv')
    #通过观察前5行,了解数据每列(特征)的概况
    data.head()

    Feature dependencies:

    data.info()

    Missing value analysis:

    data.isnull().sum()#缺失值分析

    Repeat value analysis:

    data.duplicated()#重复值分析

    Descriptive statistical analysis:

    data.describe()

    Plot a histogram with continuous probability density estimates :

    for column in data.columns:
    fig,ax = plt.subplots(figsize=(6,6))
    sns.distplot(data.loc[:,column],norm_hist=True,bins=20)

    Correlation Analysis :

    Correlation analysis refers to the analysis of two or more characteristic elements with correlation type, so as to measure the degree of correlation between the two characteristic factors. In statistics, the Pearson correlation coefficient is commonly used for correlation analysis. The Pearson correlation coefficient can be used to measure the relationship between two features (linear correlation strength). It is the simplest correlation coefficient, and the value range is [-1,1].

    corr=data.corr(method='pearson')
    corr

     result:

    It can be found that the above variables except x 11 have a strong correlation with y , and there is multicollinearity among these attributes. Consider using the Lasso feature selection model for feature selection to draw a correlation heat map to visually display the correlation .

  • Draw a heat map:

    # 绘制热力图
    plt.style.use('ggplot')
    sns.set_style('whitegrid')
    plt.subplots(figsize=(10,10))
    sns.heatmap(data.corr(method='pearson'),
                cmap='Reds',
                annot=True,  
                square=True,
               fmt='.2f',  
               yticklabels=corr.columns, 
                xticklabels=corr.columns  
               )

     result:

     3. Data preprocessing

  • Extract Key Attributes Using Lasso Feature Selection Model

    import pandas as pd
    import numpy as np
    from sklearn.linear_model import Lasso
    data = pd.read_csv('data(1).csv', header=0)
    x, y = data.iloc[:, :-1], data.iloc[:, -1]
    lasso = Lasso(alpha=1000, random_state=1)  
    lasso.fit(x, y)
    print('相关系数为', np.round(lasso.coef_, 5))
    coef = pd.DataFrame(lasso.coef_, index=x.columns)
    print('相关系数数组为\n', coef)
    mask = lasso.coef_ != 0.0
    x = x.loc[:, mask]
    mask = np.append(mask,True)
    new_reg_data = data.iloc[:,mask]
    new_reg_data = pd.concat([x, y], axis=1)
    new_reg_data.to_csv('new_reg_data.csv')

    result

    gray forecasting model

    # 自定义灰色预测函数
    def GM11(x0):
        x1 = x0.cumsum()
        z1 = (x1[:len(x1) - 1] + x1[1:]) / 2.0  
        z1 = z1.reshape((len(z1), 1))
        B = np.append(-z1, np.ones_like(z1), axis=1)
        Yn = x0[1:].reshape((len(x0) - 1, 1))
        [[a], [b]] = np.dot(np.dot(np.linalg.inv(np.dot(B.T, B)), B.T), Yn)
        f = lambda k: (x0[0] - b / a) * np.exp(-a * (k - 1)) - (x0[0] - b / a) * np.exp(-a * (k - 2))
        delta = np.abs(x0 - np.array([f(i) for i in range(1, len(x0) + 1)]))
        C = delta.std() / x0.std()
        P = 1.0 * (np.abs(delta - delta.mean()) < 0.6745 * x0.std()).sum() / len(x0)
        return f, a, b, x0[0], C, P  
    new_reg_data = pd.read_csv('new_reg_data.csv', header=0, index_col=0)  
    data = pd.read_csv('data(1).csv', header=0)  
    new_reg_data.index = range(1994, 2014)
    new_reg_data.loc[2014] = None
    new_reg_data.loc[2015] = None
    cols = ['x1', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x13']
    for i in cols:
        f = GM11(new_reg_data.loc[range(1994, 2014), i].values)[0]
        new_reg_data.loc[2014, i] = f(len(new_reg_data) - 1)  
        new_reg_data.loc[2015, i] = f(len(new_reg_data)) 
        new_reg_data[i] = new_reg_data[i].round(2)  
    y = list(data['y'].values)  
    y.extend([np.nan, np.nan])
    new_reg_data['y'] = y
    new_reg_data.to_excel('new_reg_data_GM11.xls')  
    print('预测结果为:\n', new_reg_data.loc[2014:2015, :])  

     result

    Secondly, when i=l[i], enter the corresponding column to traverse from 1994-2013, and then predict the values ​​of 2014 and 2015 based on the data of 1994-2013 and save them in the data table

    l = ['x1','x3','x4','x5','x6','x7','x8','x13']
    for i in l:
        f = GM11(new_reg_data.loc[range(1994,2014),i].as_matrix())[0]
        print('i:',i)
        print(new_reg_data.loc[range(1994,2014),i])
        new_reg_data.loc[2014,i] = f(len(new_reg_data)-1)
        print(new_reg_data.loc[2014,i])
        new_reg_data.loc[2015,i] = f(len(new_reg_data))
        print(new_reg_data.loc[2015,i])
        new_reg_data[i] = new_reg_data[i].round(2) # 保留2位小数
        print("*"*50)

      Building a Support Vector Machine Regression Model

 Fiscal revenue forecasts for 2014 and 2015 using support vector regression models

from sklearn.svm import LinearSVR
import matplotlib.pyplot as plt
data = pd.read_excel('new_reg_data_GM11.xls',index_col=0,header=0)  
feature = ['x1', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x13'] 
data_train = data.loc[range(1994, 2014)].copy()  

data_mean = data_train.mean()
data_std = data_train.std()
data_train = (data_train - data_mean) / data_std  
x_train = data_train[feature].values  
y_train = data_train['y'].values 

linearsvr = LinearSVR()  
linearsvr.fit(x_train, y_train)
x = ((data[feature] - data_mean[feature]) / data_std[feature]).values 
data[u'y_pred'] = linearsvr.predict(x) * data_std['y'] + data_mean['y']

data.to_excel('new_reg_data_GM11_revenue.xls')
print('真实值与预测值分别为:\n', data[['y', 'y_pred']])
fig = data[['y', 'y_pred']].plot(subplots=True, style=['b-o', 'r-*'])
plt.show()

result

 

Guess you like

Origin blog.csdn.net/m0_69565964/article/details/128418131