Article directory
Preface
1. Analyze and identify the key attributes that affect local fiscal revenue from the collected local fiscal revenue and various types of revenue data in a certain city. 2. Predict the
predicted values of the selected key influencing factors in 2014 and 2015.
3. Evaluate the accuracy of the model.
Reference: [Data Mining Case] Analysis and Forecasting Model of Factors Affecting Fiscal Revenue
Stand-alone, small data, pandas
1. Basic descriptive analysis of data
1.1 Guide packets and read data
- Data(data.xlsx)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from pylab import mpl
import openpyxl
# 正常显示中文标签
mpl.rcParams['font.sans-serif'] = ['SimHei']
# 正常显示负号
mpl.rcParams['axes.unicode_minus'] = False
# 禁用科学计数法
pd.set_option('display.float_format', lambda x: '%.2f' % x)
# 读入数据
data = pd.read_excel('./input/data.xlsx')
- Field meaning
- Number of employees in society (x1)
- Total wages of on-the-job employees (x2)
- Total retail sales of consumer goods (x3)
- Per capita disposable income of urban residents (x4)
- Per capita consumption expenditure of urban residents (x5)
- Total population at the end of the year (x6)
- Amount of fixed asset investment in the whole society (x7)
- Gross regional product (x8)
- Primary industry output value (x9)
- Taxes (x10)
- Consumer Price Index (x11)
- Ratio of output value between tertiary industry and secondary industry (x12)
- Gross regional product (x8) and household consumption level (x13)
1.2 Basic situation of the data
data.shape # (20, 14)
data.info()
# 描述性分析
data.describe().T
# 描述性分析
r = [data.min(), data.max(), data.mean(), data.std()]
r = pd.DataFrame(r, index=['Min', 'Max', 'Mean', 'STD']).T
r = np.round(r, 2)
r
1.3 Distribution of variables
from sklearn.preprocessing import MinMaxScaler
#实现归一化
scaler = MinMaxScaler() #实例化
scaler = scaler.fit(data) #fit,在这里本质是生成min(x)和max(x)
data_scale = pd.DataFrame(scaler.transform(data)) #通过接口导出结果
data_scale.columns = data.columns
import joypy
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from matplotlib import cm
import seaborn as sns
fig, axes = joypy.joyplot(data_scale, alpha=.5, color='#FFCC99')#连续值的列为一个"脊"
data_scale.plot()
1.4 Correlation analysis
pear = np.round(data.corr(method = 'pearson'), 2)
pear
plt.figure(figsize=(12,12))
sns.heatmap(data.corr(), center=0,
square=True, linewidths=.5, cbar_kws={
"shrink": .5},annot=True, fmt='.1f')
#设置x轴
plt.xticks(fontsize=15)
#设置y轴
plt.yticks(fontsize=15)
plt.tight_layout()
plt.savefig('a.png')
- It can be seen from the figure that the linear relationship between the consumer price index (x11) and fiscal revenue is not significant and shows a negative correlation. The remaining variables are highly positively correlated with fiscal revenue.
2. Data preprocessing
- Screening of variables (selection of analysis methods):
- In the past, analysis of fiscal revenue would use multiple linear regression models and least squares estimation methods to estimate the coefficients of the regression model. Whether the coefficients can be tested to test the relationship between them, but such results are highly dependent on data, and What is obtained is often only the local optimal solution, and subsequent tests may lose their due significance.
- Therefore, this case uses the Adaptive-Lasso variable selection method to study. For Lasso, here is the theoretical knowledge in the book.
2.1 Lasso variable selection model
- The function AdaptiveLasso is not found here. Use Lasso instead.
from sklearn.linear_model import Lasso
model = Lasso(alpha=0.1, max_iter=100000)
model.fit(data.iloc[:, 0:13], data['y'])
q=model.coef_#各特征的系数
q=pd.DataFrame(q,index=data.columns[:-1])
q
- If the eigenvalue of a variable is calculated to be non-zero, it means that the variable has a greater impact on the predictor variable, and if the eigenvalue of a variable is zero, it means that the variable has little impact on the predictor variable.
- Adjust parameter values (reference: https://blog.csdn.net/weixin_43746433/article/details/100047231)
from sklearn.linear_model import Lasso
lasso = Lasso(1000) #调用Lasso()函数,设置λ的值为1000
lasso.fit(data.iloc[:,0:13],data['y'])
print('相关系数为:',np.round(lasso.coef_,5)) #输出结果,保留五位小数
## 计算相关系数非零的个数
print('相关系数非零个数为:',np.sum(lasso.coef_ != 0))
mask = lasso.coef_ != 0 #返回一个相关系数是否为零的布尔数组
print('相关系数是否为零:',mask)
new_reg_data = data.iloc[:,:13].iloc[:,mask] #返回相关系数非零的数据
new_reg_data = pd.concat([new_reg_data,data.y],axis=1)
new_reg_data.to_excel('new_reg_data.xlsx')
- According to the non-zero coefficient, the variables finally filtered by Lasso are as follows
- After the variables are screened, the next step is to start modeling.
3. Establish fiscal revenue forecast model
3.1 Gray model
- Gray model learning: https://blog.csdn.net/qq_42374697/article/details/106611556
def GM11(x0): #自定义灰色预测函数
import numpy as np
x1 = x0.cumsum() # 生成累加序列
z1 = (x1[:len(x1)-1] + x1[1:])/2.0 # 生成紧邻均值(MEAN)序列,比直接使用累加序列好,共 n-1 个值
z1 = z1.reshape((len(z1),1))
B = np.append(-z1, np.ones_like(z1), axis = 1) # 生成 B 矩阵
Y = x0[1:].reshape((len(x0)-1, 1)) # Y 矩阵
[[a],[u]] = np.dot(np.dot(np.linalg.inv(np.dot(B.T, B)), B.T), Y) #计算参数
f = lambda k: (x0[0]-u/a)*np.exp(-a*(k-1))-(x0[0]-u/a)*np.exp(-a*(k-2)) #还原值
delta = np.abs(x0 - np.array([f(i) for i in range(1,len(x0)+1)])) # 计算残差
C = delta.std()/x0.std()
P = 1.0*(np.abs(delta - delta.mean()) < 0.6745*x0.std()).sum()/len(x0)
return f, a, u, x0[0], C, P #返回灰色预测函数、a、b、首项、方差比、小残差概率
data.index = range(1994, 2014)
data.loc[2014] = None
data.loc[2015] = None
# 模型精度评价
# 被lasso筛选出来的6个变量
l = ['x1', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x13']
for i in l:
GM = GM11(data[i][list(range(1994, 2014))].values)
f = GM[0]
c = GM[-2]
p = GM[-1]
data[i][2014] = f(len(data)-1)
data[i][2015] = f(len(data))
data[i] = data[i].round(2)
if (c < 0.35) & (p > 0.95):
print('对于模型{},该模型精度为---好'.format(i))
elif (c < 0.5) & (p > 0.8):
print('对于模型{},该模型精度为---合格'.format(i))
elif (c < 0.65) & (p > 0.7):
print('对于模型{},该模型精度为---勉强合格'.format(i))
else:
print('对于模型{},该模型精度为---不合格'.format(i))
data[l+['y']].to_excel('data2.xlsx')
- The predicted values are as follows:
3.2 Neural network prediction model
- Next, we use historical data to build a neural network model.
- Its parameters are set to the error precision of 107, the number of learning times to 10,000, and the number of neurons to the number of variables selected by the Lasso variable selection method, 8.
'''神经网络'''
data2 = pd.read_excel('data2.xlsx', index_col=0)
# 提取数据
feature = list(data2.columns[:len(data2.columns)-1]) # ['x1', 'x2', 'x3', 'x4', 'x5', 'x7']
train = data2.loc[list(range(1994, 2014))].copy()
mean = train.mean()
std = train.std()
train = (train - mean) / std # 数据标准化,这里使用标准差标准化
x_train = train[feature].values
y_train = train['y'].values
# 建立神经网络模型
from keras.models import Sequential
from keras.layers import Dense, Activation
import tensorflow
model = Sequential()
model.add(Dense(input_dim=8, units=12))
model.add(Activation('relu'))
model.add(Dense(input_dim=12, units=1))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(x_train, y_train, epochs=10000, batch_size=16)
model.save_weights('net.model')
- After training the model, bring 1994 − 2005 into the model for prediction
# 将整个变量矩阵标准化
x = ((data2[feature] - mean[feature]) / std[feature]).values
# 预测,并还原结果
data2['y_pred'] = model.predict(x) * std['y'] + mean['y']
data2.to_excel('data3.xlsx')
- forecast result
- Plot a line graph between true and predicted values
import matplotlib.pyplot as plt
p = data2[['y', 'y_pred']].plot(style=['b-o', 'r-*'])
p.set_ylim(0, 2500)
p.set_xlim(1993, 2016)
plt.show()
plt.savefig('plot.png') # 保存图像为 plot.png
- From the results, the predicted values are basically consistent with the actual values.
- In order to compare with the neural network prediction results, let's use other prediction models to see the results.
from sklearn.linear_model import LinearRegression # 线性回归
from sklearn.neighbors import KNeighborsRegressor # K近邻回归
from sklearn.neural_network import MLPRegressor # 神经网络回归
from sklearn.tree import DecisionTreeRegressor # 决策树回归
from sklearn.tree import ExtraTreeRegressor # 极端随机森林回归
from xgboost import XGBRegressor # XGBoot
from sklearn.ensemble import RandomForestRegressor # 随机森林回归
from sklearn.ensemble import AdaBoostRegressor # Adaboost 集成学习
from sklearn.ensemble import GradientBoostingRegressor # 集成学习梯度提升决策树
from sklearn.ensemble import BaggingRegressor # bagging回归
from sklearn.linear_model import ElasticNet
from sklearn.metrics import explained_variance_score,\
mean_absolute_error,mean_squared_error,\
median_absolute_error,r2_score
models=[LinearRegression(),KNeighborsRegressor(),MLPRegressor(alpha=20),DecisionTreeRegressor(),ExtraTreeRegressor(),XGBRegressor(),RandomForestRegressor(),AdaBoostRegressor(),GradientBoostingRegressor(),BaggingRegressor(),ElasticNet()]
models_str=['LinearRegression','KNNRegressor','MLPRegressor','DecisionTree','ExtraTree','XGBoost','RandomForest','AdaBoost','GradientBoost','Bagging','ElasticNet']
data2 = pd.read_excel('data2.xlsx', index_col=0)
# 提取数据
feature = ['x1', 'x3', 'x4', 'x5', 'x6', 'x7', 'x8', 'x13']
train = data2.loc[list(range(1994, 2014))].copy()
mean = train.mean()
std = train.std()
train = (train - mean) / std # 数据标准化,这里使用标准差标准化
x_train = train[feature].values
y_train = train['y'].values
# 将整个变量矩阵标准化
x = ((data2[feature] - mean[feature]) / std[feature]).values
for name,model in zip(models_str,models):
print('开始训练模型:'+name)
model=model #建立模型
a = 'y_pred_'+ name
data2[a] = model.fit(x_train,y_train).predict(x) * std['y'] + mean['y']
df=data2[:-2]
print('平均绝对误差为:',mean_absolute_error(df['y'].values,df[a].values))
print('均方误差为:',mean_squared_error(df['y'],df[a]))
print('中值绝对误差为:',median_absolute_error(df['y'],df[a]))
print('可解释方差值为:',explained_variance_score(df['y'],df[a]))
print('R方值为:',r2_score(df['y'],df[a]))
print('*-*'*15)
Environment setup
- After the installation is successful, the image will pop up (otherwise it cannot be installed normally)
# 设置中文字体
plt.rcParams['font.sans-serif'] = 'SimHei'
Spark pans API interface (understand)
- Use the pandsAPI interface provided by Spark for data processing
- The processing operation is the same as pands, with slight differences.
- Suitable for processing large data (≥1M)
Ten minutes to learn about the Pandas API on Spark (1)
Note that when reading an Excel file containing strings, you need to specify the schema of the Spark DataFrame to ensure that the type of each column is correct.
Scientific notation cannot be turned off.
from pyspark.sql import SparkSession
import pyspark.pandas as ps
# 创建SparkSession
spark = SparkSession.builder.getOrCreate()
# 使用pandas读取Excel文件
pandas_df = ps.read_excel('./input/data.xlsx')
pandas_df = pandas_df.astype(str)
print(pandas_df)
Official: Pandas APD on Spark
Distributed + Spark for data processing
File upload to hdfs (distributed storage)
Spark data processing
from pyspark.sql import SparkSession
# 创建 SparkSession
spark = SparkSession.builder.appName("ReadData").getOrCreate()
# 读取数据文件
data = spark.read.format("csv").option("header", "true").option("inferSchema", "true").load("input/data.csv")
# 指定列名
columns = ["x1", "x2", "x3", "x4", "x5", "x6", "x7", "x8", "x9", "x10", "x11", "x12", "x13", "y"]
data = data.toDF(*columns)
# 显示数据
data.show()
import seaborn as sns
import matplotlib.pyplot as plt
# 计算相关性矩阵
correlation_matrix = data.drop("label").toPandas().corr()
# 创建热力图
plt.figure(figsize=(10, 8))
sns.heatmap(correlation_matrix, annot=True, cmap="coolwarm")
plt.title("Correlation Matrix")
plt.show()