[Spark Big Job] Analysis and Forecasting Model of Factors Affecting Financial Revenue


Preface

1. Analyze and identify the key attributes that affect local fiscal revenue from the collected local fiscal revenue and various types of revenue data in a certain city. 2. Predict the
predicted values ​​of the selected key influencing factors in 2014 and 2015.
3. Evaluate the accuracy of the model.

Reference: [Data Mining Case] ​​Analysis and Forecasting Model of Factors Affecting Fiscal Revenue


Stand-alone, small data, pandas

1. Basic descriptive analysis of data

1.1 Guide packets and read data

  • Data(data.xlsx)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import mpl
import openpyxl

# 正常显示中文标签
mpl.rcParams['font.sans-serif'] = ['SimHei']
# 正常显示负号
mpl.rcParams['axes.unicode_minus'] = False
# 禁用科学计数法
pd.set_option('display.float_format', lambda x: '%.2f' % x)

# 读入数据
data = pd.read_excel('./input/data.xlsx')

Insert image description here

  • Field meaning
    • Number of employees in society (x1)
    • Total wages of on-the-job employees (x2)
    • Total retail sales of consumer goods (x3)
    • Per capita disposable income of urban residents (x4)
    • Per capita consumption expenditure of urban residents (x5)
    • Total population at the end of the year (x6)
    • Amount of fixed asset investment in the whole society (x7)
    • Gross regional product (x8)
    • Primary industry output value (x9)
    • Taxes (x10)
    • Consumer Price Index (x11)
    • Ratio of output value between tertiary industry and secondary industry (x12)
    • Gross regional product (x8) and household consumption level (x13)

1.2 Basic situation of the data

data.shape # (20, 14)
data.info()

Insert image description here

# 描述性分析
data.describe().T

Insert image description here

# 描述性分析
r = [data.min(), data.max(), data.mean(), data.std()]
r = pd.DataFrame(r, index=['Min', 'Max', 'Mean', 'STD']).T
r = np.round(r, 2)
r

Insert image description here

1.3 Distribution of variables

from sklearn.preprocessing import MinMaxScaler

#实现归一化
scaler = MinMaxScaler() #实例化
scaler = scaler.fit(data) #fit,在这里本质是生成min(x)和max(x)
data_scale = pd.DataFrame(scaler.transform(data)) #通过接口导出结果
data_scale.columns = data.columns

import joypy
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import cm
import seaborn as sns

fig, axes = joypy.joyplot(data_scale, alpha=.5, color='#FFCC99')#连续值的列为一个"脊"

Insert image description here

data_scale.plot()

Insert image description here

1.4 Correlation analysis

pear = np.round(data.corr(method = 'pearson'), 2)
pear

Insert image description here

plt.figure(figsize=(12,12))
sns.heatmap(data.corr(), center=0,
            square=True, linewidths=.5, cbar_kws={
    
    "shrink": .5},annot=True, fmt='.1f')
#设置x轴
plt.xticks(fontsize=15)
#设置y轴
plt.yticks(fontsize=15)
plt.tight_layout()
plt.savefig('a.png')

Insert image description here

  • It can be seen from the figure that the linear relationship between the consumer price index (x11) and fiscal revenue is not significant and shows a negative correlation. The remaining variables are highly positively correlated with fiscal revenue.

2. Data preprocessing

  • Screening of variables (selection of analysis methods):
    • In the past, analysis of fiscal revenue would use multiple linear regression models and least squares estimation methods to estimate the coefficients of the regression model. Whether the coefficients can be tested to test the relationship between them, but such results are highly dependent on data, and What is obtained is often only the local optimal solution, and subsequent tests may lose their due significance.
    • Therefore, this case uses the Adaptive-Lasso variable selection method to study. For Lasso, here is the theoretical knowledge in the book.

2.1 Lasso variable selection model

  • The function AdaptiveLasso is not found here. Use Lasso instead.
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1, max_iter=100000)
model.fit(data.iloc[:, 0:13], data['y'])
q=model.coef_#各特征的系数
q=pd.DataFrame(q,index=data.columns[:-1])
q

Insert image description here

  • If the eigenvalue of a variable is calculated to be non-zero, it means that the variable has a greater impact on the predictor variable, and if the eigenvalue of a variable is zero, it means that the variable has little impact on the predictor variable.
  • Adjust parameter values ​​(reference: https://blog.csdn.net/weixin_43746433/article/details/100047231)
from sklearn.linear_model import Lasso
lasso = Lasso(1000)  #调用Lasso()函数,设置λ的值为1000
lasso.fit(data.iloc[:,0:13],data['y'])
print('相关系数为:',np.round(lasso.coef_,5))  #输出结果,保留五位小数
## 计算相关系数非零的个数
print('相关系数非零个数为:',np.sum(lasso.coef_ != 0))
mask = lasso.coef_ != 0  #返回一个相关系数是否为零的布尔数组
print('相关系数是否为零:',mask)

new_reg_data = data.iloc[:,:13].iloc[:,mask]  #返回相关系数非零的数据
new_reg_data = pd.concat([new_reg_data,data.y],axis=1)
new_reg_data.to_excel('new_reg_data.xlsx')

Insert image description here

  • According to the non-zero coefficient, the variables finally filtered by Lasso are as follows
    Insert image description here
  • After the variables are screened, the next step is to start modeling.

3. Establish fiscal revenue forecast model

Insert image description here

3.1 Gray model

  • Gray model learning: https://blog.csdn.net/qq_42374697/article/details/106611556
def GM11(x0): #自定义灰色预测函数
    import numpy as np
    x1 = x0.cumsum() # 生成累加序列
    z1 = (x1[:len(x1)-1] + x1[1:])/2.0 # 生成紧邻均值(MEAN)序列,比直接使用累加序列好,共 n-1 个值
    z1 = z1.reshape((len(z1),1))
    B = np.append(-z1, np.ones_like(z1), axis = 1)    # 生成 B 矩阵
    Y = x0[1:].reshape((len(x0)-1, 1))    # Y 矩阵
    [[a],[u]] = np.dot(np.dot(np.linalg.inv(np.dot(B.T, B)), B.T), Y)    #计算参数
    f = lambda k: (x0[0]-u/a)*np.exp(-a*(k-1))-(x0[0]-u/a)*np.exp(-a*(k-2))    #还原值
    delta = np.abs(x0 - np.array([f(i) for i in range(1,len(x0)+1)]))    # 计算残差
    C = delta.std()/x0.std()
    P = 1.0*(np.abs(delta - delta.mean()) < 0.6745*x0.std()).sum()/len(x0)
    return f, a, u, x0[0], C, P #返回灰色预测函数、a、b、首项、方差比、小残差概率
data.index = range(1994, 2014)
data.loc[2014] = None
data.loc[2015] = None
# 模型精度评价
# 被lasso筛选出来的6个变量
l = ['x1', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x13']
for i in l:
    GM = GM11(data[i][list(range(1994, 2014))].values)
    f = GM[0]
    c = GM[-2]
    p = GM[-1]
    data[i][2014] = f(len(data)-1)
    data[i][2015] = f(len(data))
    data[i] = data[i].round(2)
    if (c < 0.35) & (p > 0.95):
        print('对于模型{},该模型精度为---好'.format(i))
    elif (c < 0.5) & (p > 0.8):
        print('对于模型{},该模型精度为---合格'.format(i))
    elif (c < 0.65) & (p > 0.7):
        print('对于模型{},该模型精度为---勉强合格'.format(i))
    else:
        print('对于模型{},该模型精度为---不合格'.format(i))

data[l+['y']].to_excel('data2.xlsx')

Insert image description here
Insert image description here

  • The predicted values ​​are as follows:

Insert image description here

3.2 Neural network prediction model

  • Next, we use historical data to build a neural network model.
  • Its parameters are set to the error precision of 107, the number of learning times to 10,000, and the number of neurons to the number of variables selected by the Lasso variable selection method, 8.
'''神经网络'''
data2 = pd.read_excel('data2.xlsx', index_col=0)
# 提取数据
feature = list(data2.columns[:len(data2.columns)-1]) # ['x1', 'x2', 'x3', 'x4', 'x5', 'x7']
train = data2.loc[list(range(1994, 2014))].copy()
mean = train.mean()
std = train.std()
train = (train - mean) / std    # 数据标准化,这里使用标准差标准化
x_train = train[feature].values
y_train = train['y'].values

# 建立神经网络模型
from keras.models import Sequential
from keras.layers import Dense, Activation
import tensorflow

model = Sequential()
model.add(Dense(input_dim=8, units=12))
model.add(Activation('relu'))
model.add(Dense(input_dim=12, units=1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(x_train, y_train, epochs=10000, batch_size=16)
model.save_weights('net.model')

Insert image description here

  • After training the model, bring 1994 − 2005 into the model for prediction
# 将整个变量矩阵标准化
x = ((data2[feature] - mean[feature]) / std[feature]).values
# 预测,并还原结果
data2['y_pred'] = model.predict(x) * std['y'] + mean['y']
data2.to_excel('data3.xlsx')
  • forecast result

Insert image description here

  • Plot a line graph between true and predicted values
import matplotlib.pyplot as plt
p = data2[['y', 'y_pred']].plot(style=['b-o', 'r-*'])
p.set_ylim(0, 2500)
p.set_xlim(1993, 2016)
plt.show()
plt.savefig('plot.png')  # 保存图像为 plot.png

Insert image description here

  • From the results, the predicted values ​​are basically consistent with the actual values.
  • In order to compare with the neural network prediction results, let's use other prediction models to see the results.
from sklearn.linear_model import LinearRegression # 线性回归
from sklearn.neighbors import KNeighborsRegressor # K近邻回归
from sklearn.neural_network import MLPRegressor # 神经网络回归
from sklearn.tree import DecisionTreeRegressor # 决策树回归
from sklearn.tree import ExtraTreeRegressor # 极端随机森林回归
from xgboost import XGBRegressor # XGBoot
from sklearn.ensemble import RandomForestRegressor # 随机森林回归
from sklearn.ensemble import AdaBoostRegressor  # Adaboost 集成学习
from sklearn.ensemble import GradientBoostingRegressor # 集成学习梯度提升决策树
from sklearn.ensemble import BaggingRegressor # bagging回归
from sklearn.linear_model import ElasticNet

from sklearn.metrics import explained_variance_score,\
mean_absolute_error,mean_squared_error,\
median_absolute_error,r2_score

models=[LinearRegression(),KNeighborsRegressor(),MLPRegressor(alpha=20),DecisionTreeRegressor(),ExtraTreeRegressor(),XGBRegressor(),RandomForestRegressor(),AdaBoostRegressor(),GradientBoostingRegressor(),BaggingRegressor(),ElasticNet()]
models_str=['LinearRegression','KNNRegressor','MLPRegressor','DecisionTree','ExtraTree','XGBoost','RandomForest','AdaBoost','GradientBoost','Bagging','ElasticNet']


data2 = pd.read_excel('data2.xlsx', index_col=0)
# 提取数据
feature = ['x1', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x13']
train = data2.loc[list(range(1994, 2014))].copy()
mean = train.mean()
std = train.std()
train = (train - mean) / std    # 数据标准化,这里使用标准差标准化
x_train = train[feature].values
y_train = train['y'].values
# 将整个变量矩阵标准化
x = ((data2[feature] - mean[feature]) / std[feature]).values


for name,model in zip(models_str,models):
    print('开始训练模型:'+name)
    model=model   #建立模型
    a = 'y_pred_'+ name
    data2[a] = model.fit(x_train,y_train).predict(x) * std['y'] + mean['y']
    df=data2[:-2]
    print('平均绝对误差为:',mean_absolute_error(df['y'].values,df[a].values))
    print('均方误差为:',mean_squared_error(df['y'],df[a]))
    print('中值绝对误差为:',median_absolute_error(df['y'],df[a]))
    print('可解释方差值为:',explained_variance_score(df['y'],df[a]))
    print('R方值为:',r2_score(df['y'],df[a]))
    print('*-*'*15)

Insert image description here


Environment setup

Insert image description here

  • After the installation is successful, the image will pop up (otherwise it cannot be installed normally)

Insert image description here

# 设置中文字体
plt.rcParams['font.sans-serif'] = 'SimHei'

Spark pans API interface (understand)

  • Use the pandsAPI interface provided by Spark for data processing
  • The processing operation is the same as pands, with slight differences.
  • Suitable for processing large data (≥1M)

Ten minutes to learn about the Pandas API on Spark (1)
Note that when reading an Excel file containing strings, you need to specify the schema of the Spark DataFrame to ensure that the type of each column is correct.
Scientific notation cannot be turned off.

from pyspark.sql import SparkSession
import pyspark.pandas as ps

# 创建SparkSession
spark = SparkSession.builder.getOrCreate()

# 使用pandas读取Excel文件
pandas_df = ps.read_excel('./input/data.xlsx')
pandas_df = pandas_df.astype(str)

print(pandas_df)

Insert image description here

Official: Pandas APD on Spark

Distributed + Spark for data processing

File upload to hdfs (distributed storage)

Spark data processing

from pyspark.sql import SparkSession

# 创建 SparkSession
spark = SparkSession.builder.appName("ReadData").getOrCreate()
# 读取数据文件
data = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("input/data.csv")
# 指定列名
columns = ["x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13", "y"]
data = data.toDF(*columns)
# 显示数据
data.show()

Insert image description here

import seaborn as sns
import matplotlib.pyplot as plt

# 计算相关性矩阵
correlation_matrix = data.drop("label").toPandas().corr()

# 创建热力图
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()

Insert image description here

Guess you like

Origin blog.csdn.net/Lenhart001/article/details/131452379